A Supervised Learning Approach For Heading Detection

A Supervised Learning Approach For Heading
Detection
Sahib Singh Budhiraja and Vijay Mago

(sbudhira,vmago)@lakeheadu.ca
Lakehead University, 955 Oliver Rd, Thunder Bay, ON P7B 5E1

arXiv:1809.01477v1 [cs.IR] 31 Aug 2018
Abstract. As the Portable Document Format (PDF) file format in-

creases in popularity, research in analysing its structure for text ex-
traction and analysis is necessary. Detecting headings can be a crucial
component of classifying and extracting meaningful data. This research
involves training a supervised learning model to detect headings with fea-
tures carefully selected through recursive feature elimination. The best
performing classifier had an accuracy of 96.95%, sensitivity of 0.986 and
a specificity of 0.953. This research into heading detection contributes
to the field of PDF based text extraction and can be applied to the au-
tomation of large scale PDF text analysis in a variety of professional and
policy based contexts.
Keywords: Heading Detection · Text Segmentation · Supervised Ap-

proach.
1 Introduction
As the amount of information stored within PDF documents increases world-
wide, the opportunities for large scale text based analysis requires increasingly
automated processes, as the amount of document processing is time consuming
and labour intensive for human professionals. Systematic processing and extrac-
tion of textual structure is increasingly necessary and useful as demonstrated
in El-Haj et al.’s work involving 1500 financial statements[7]. Categorizing data
into seperate sections is quite easy for humans, as they rely on visual cues such as
headings to process textual information. Machines, despite being able to process
large amounts information at high speeds, require effort to classify and interpret
text based data. This paper explores the application of supervised classifiers to
operationalize a system that would aid in the identification of headings. PDF
documents are a visually exact digital copy that displays text by drawing char-
acters on a specific location [10] and present a challenge in analysis because the
files do not provide enough information on how the text is organized and for-
matted.
A supervised classifier that is trained on labelled data provides one solution to
categorizing PDF text as it tells the classifier how to make predictions based on
the data provided. This research involved comparing and systematically testing
2 S. Budhiraja and V. Mago
a variety of classifiers for the purpose of selecting classifiers best suited to this
application. Recursive feature elimination[8]is used to ensure the classifiers only
use the best and minimum number of features for making predictions. Cross
validation is used to tune the hyper parameters of a given machine learning al-
gorithm for increased performance before testing it out on test data. The final
trained classifier is currently being applied to detect headings in course outline
documents and extract learning outcomes. The extracted learning outcomes are
being used for automating the process of developing university/college transfer
credit agreements by using semantic similarity algorithms[3].
2 Related Work
While PDF format is convenient as it preserves the structure of a document

across platforms, extracting textual layout information is required for detecting
headings and further analysis. One solution to extracting layout information is
to convert the PDF into HTML and use the HTML tags for further analysis.
Once converted to HTML all the information related to text formatting required
for the analysis, like font size and boldness of text, can be easily extracted. A
variety of PDF to HTML document tools are available and have been assessed
based on the text and structural loss associated with each tool[4]. Additional
work includes PDF to HTML text detection approaches that maintain layout
and font information[2], table detection, extraction and annotation[1] and analy-
sis using white spaces[5]. HTML conversion is clearly a well established approach
to analyzing PDF layout and content.
Previous research provides insight into processes related to extracting the head-
ing layout of a HTML document[6]. In Manabe’s work, headings are used to
divide a document at certain locations that indicate a change in topic. Docu-
ment Object Model(DOM) trees are used to sort candidate headings based on
their significance and to define blocks. A recursive approach is applied for docu-
ment segmentation using the list of candidate headings and evaluate with good
results using a manually labelled dataset.
El-Haj et al. provide a practical application of document structure detection

through the analysis of a large corpus of UK financial reports including 1500
annual reports from 200 different organizations. A list of ‘gold standard’ section
names was generated from 50 randomly selected reports and used to match with
corresponding sections of every document page in the dataset. Section matches
were then extracted and evaluated using sensitivity, specificity and F1 score in
addition to being reviewed by a domain expert for accuracy[7].
Current research has taken steps towards a system which analyses a document’s
textual structure. But there is a need to have an approach that can efficiently
and accurately analyse the textual layout of a document and divide it into con-
A Supervised Learning Approach For Heading Detection 3
tent sections to automate the process of extracting text from a PDF documents.
We present out supervised learning approach for heading detection as a solution
for it.
3 Methodology
3.1 Data Collection
Our data set consisted of 500 documents1 downloaded from Google using Google
Custom Search API [11]. To extract the correspoding formatting/style informa-
tion the documents were converted from PDF to HTML using pdf2txt, which is
a PDFMiner wrapper available in Python [12]. This is illustrated in Fig 1 which
shows some sample text and its corresponding HTML tags generated using the
conversion process. The final data points are also shown in the Fig 1, which
was generated by parsing the HTML tags using regular expressions. A regular
expression is string of characters used to define a search pattern[13]. The regular
expressions used for parsing the tags are as follows:
To extract Font size and corresponding text:
r‘< \s∗?span[∧ >]∗f ont−size : (\w∗)px[> ]∗ > (.∗?) < \/span\b[> ]∗ > ’
To check if text is bold we look for the following regular expression for
the word bold in the starting tag:
r‘[Bb]old’
Each data point contains some text, font size and a flag which is either 1 or
0 depending on the corresponding text being bold or not. The whole process
yielded 83,194 data points, which was then exported into an Excel file for further
pre-processing.
3.2 Data Preprocessing
The process of transforming raw data into usable training data is referred to
as data preprocessing. The steps of data preprocessing for this research are as
follows:
Data Labelling: Data labelling refers to the process of assigning data points
labels, this makes the data suitable for training supervised machine learning
models. All the 83914 data points are manually labelled by cross referring to the
1
Repository available at: https://github.com/sahib-s/Heading-Detection-PDF-Files
Fig. 1. Extraction of Data from Documents

documents as both training and testing data needs to be labelled. If the text in
the data point was a heading the label was set to 1 otherwise 0. Labelling data
is one of the most important steps of preprocessing because the performance of
the model depends on how well the data is labelled. Example of labelled data
points is provided in Fig 1(c).
Balancing The Dataset: The dataset is considered imbalanced if the preva-

lence of one class is more than the other. The number of headings in our dataset
is very less as compared to non-headings, this is because of the fact that the
number of headings in a document is far less than the number of other text
elements. Sklearn’s implementation for Synthetic Minority Over-sampling Tech-
nique (SMOTE) is used to balanced the dataset, which does so by creating
synthetic data points for the minority class to make it even [18,16].
Data Transformation: The process of transforming data into a form that

has more predictive value is known as data transformation. The purpose of
data transformation is to convert raw data points into ‘features’ that contribute
to more predictive value in training and decision making related to heading
identification. For example, font size and text are two features from the raw
data which, in their base form, do not have much value but can be transformed
into useful features for training an efficient model. The list of transformed data
fields are as follows:
– Font Flag: Headings tend to be larger in terms of font size as compared

to the paragraph text that follows. Therefore, a higher font size increases
the probability that the text is a heading. However, since each document is
unique, there can not be a single threshold applied across all instances.
Thresholds are calculated for each document by measuring the frequency of
each font size where each character with a particular font size is counted as
one instance. The font size which has the maximum frequency is used as the
threshold. This approach relies on the assumption that the most frequently
used font size is the one that is being used for the paragraph text, so having
any font size above that increases the probability of that text being a heading.
Fig 2 shows that the most frequently used font size is for the paragraph text
with size 9 and all other text above it has more chances of being a heading.
Font Flag can take two possible values 0 and 1. If the font size for that data
point is less than the corresponding threshold then the value is set to 0,
otherwise it is set as 1.
– Text: The text is transformed into the following feature variables, which are
also listed in Table 1.
• Number of Words: The number of words in the text can be used for
training, as headings tend to have less words when compared to regular
sentences and paragraphs.
• Text Case: Headings mostly use title case, while sometimes they are in
upper case as well. This variable tells whether the text is in upper case
Fig. 2. Font Size Threshold Assumption Example

(all letters in upper case), lower case (all letters in lower case), title case
(first letter of all words in uppercase) or sentence case (only the first
letter of the text in uppercase).
• Features From Parts of speech(POS) Tagging: POS Tagging is the pro-
cess of assigning parts of speech (verb, adverb, adjective, noun) to each
word, which are referred to as tokens. The text from each data point is
first tokenized and then each token is assigned a POS label [9].
The POS frequencies provides the model with information on the grammati-
cal aspect of the text and can be used to exploit the frequency of these labels
in a text to identify headings and contribute to the accuracy of the model.
For example, headings tend to have no verbs in them, though some might
have them but absence of verbs increases the probability of the text being
an heading. All frequency data collected from POS tagging is analysed in
the feature selection process to differentiate between useful and irrelevant
features collected through it. The frequency for each POS label is calculated
and used to calculate the frequency of each POS tag in the text for each
data point. These frequencies serve as potential features for the model.
All these features brings the count of total number of features generated using
the text to 11, 9 from POS tagging the text and 2 using its physical properties.
Table 1. List of all features
All features are integers, except for Bold or Not and Font Threshold Flag which are
binary.
Feature Name Description
Characters Number of characters in the text.
Words Number of words in the text.
Text Case Assumes the value 0,1,2 or 3 depending on the text being be-
ing in lower case, upper case, title case or none of the three
respectively.
Bold or Not Assumes the value 1 or 0 depending on the text being bold or
not.
Font Threshold Flag Assumes the value 1 or 0 depending on the font size of the text
being greater than the threshold or not.
Verbs Number of verbs in the text.
Nouns Number of nouns in the text.
Adjectives Number of adjectives in the text.
Adverbs Number of adverbs in the text.
Pronouns Number of pronouns in the text.
Cardinal Numbers Number of cardinal numbers in the text.
Coordinating Con- Number of coordinating conjunctions in the text.
junctions
Predeterminers Number of predeterminer in the text.
Interjections Number of Interjections in the text.
3.3 Feature Selection
After pre-processing, 14 training features are established. There is a need to

select the top features for building each individual model with maximum ac-
curacy. Table 1 lists all the features we are choosing from. To achieve this we
used Recursive feature elimination with Cross-Validation (RFECV), which re-
cursively removes weak attributes/features and uses the model accuracy to iden-
tify features that are contributing towards increasing the predictive power of the
model[8]. The selection process is performed using the machine learning library,
“scikit-learn”.
Cross validation is done by making 10 folds in the training set where one feature
is removed per iteration. As per this analysis the accuracy does not increase on
choosing to train the Decision Tree classifier with more than the following seven
features:
– Bold or Not
– Font Threshold Flag
– Number of words
– Text Case
– Verbs
– Nouns
– Cardinal Numbers
The same process is repeated for all the classifiers and their individual set of
chosen features are listed in Table 2.
3.4 Grid Search
Tuning each classifiers parameters for optimal performance is performed using

accuracy from cross validation as a measure. We use various combinations of
classifiers parameters and choose the combination with the best cross valida-
tion accuracy. This process is performed on various classifiers to choose their
corresponding parameters. The description along with the final selected tuning
parameters for each classifier used in this research are discussed in the next
section.
Table 2. Selected features for each classifier
Classifier Name Selected Features

Decision Tree Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Cardinal Numbers
SVM Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Adjectives, Adverbs
k-Nearest Neaigh- Bold or Not, Font Threshold Flag, Words, Verbs, Nouns, Adjec-
bors tives, Cardinal Numbers, Coordinating Conjunctions
Random Forest Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Adverbs, Cardinal Numbers, Coordinating Conjunctions
Gaussian Naive Bold or Not, Font Threshold Flag, Words, Verbs, Nouns, Adjec-
Bayes tives, Cardinal Numbers, Coordinating Conjunctions
Quadratic Discrimi- Bold or Not, Font Threshold Flag, Words, Verbs, Nouns, Adjec-
nant Analysis tives, Coordinating Conjunctions
Logistic Regression Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Adverbs, Coordinating Conjunctions
Gradient Boosting Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Neural Net Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
3.5 Training
After the most suitable features and parameters for each classifier have been
selected, we can proceed with training the classfiers using scikit-learn [18].
Decision Tree Decision trees are the most widely used amongst classifiers
as they have a simple flow-chart like structure starting from a root node. It
branches off to further nodes and terminating at a leaf node. At each non-leaf
node a decision is made, which selects the branch to follow. The process contin-
ues to the point where a leaf node is reached, which contains the corresponding
decison[14]. Gini impurity is used as a measure for quality of a split, which tells
if the split made the dataset more pure. Using Gini makes it computationally
less expensive as compared to entropy which involves computation of logarithmic
functions. The “best” option for strategy chooses the best split at each node.
The minimum number of samples required to split an internal node is set to 2
and the minimum number of samples needed to be at a leaf node is set to 3.
The code snippet for training this classifier with the chosen parameters is given
in Box 1
Box 1: Code Snippet for Training Decision Tree Classifier
treeclf = DecisionTreeClassifier(criterion = ‘gini’, splitter = ‘best’,

min samples split = 2, min samples leaf = 3)
treeclf = treeclf.fit(traindata, truelabels)
Support Vector Machine (SVM) It is a classifier that uses multi-dimensional

hyperplanes to make classification. SVM also uses kernel functions to transform
the data in sucha way that it is feasible for the hyperplane to effectively partition
classes[15]. The kernel used is radial basis function(rbf), degree of the polynomial
kernel function is set to 3 and gama is set to “auto”. The shrinking heuristics
were enabled as they speed up the optimization. Tolerance for stop criteria is set
to 2e − 3 and ‘ovr’(one vs rest) decision function is chosen for decision function
shape. The code snippet for training this classifier with the chosen parameters
is given in Box 2.
Box 2: Code Snippet for Training Support Vector Machine Classifier
svmclf = SVC(kernel=‘rbf’, degree=3, gamma=‘auto’, shrinking=True,

tol=0.002, decision function shape=‘ovr’)
svmclf = svmclf.fit(traindata, truelabels)
k-Nearest Neighbors The main idea behind k-Nearest Neighbors is that it

takes into account the class of its neighbors to decide how to classify the data
point under consideration. Each neighbors class is considered as their vote to-
wards that class and the class with the most votes is assigned to that data
point[17]. The number of neighbours used to classify a point is set to 10. Each
neighbours are weighed equally as weights is set to ‘distance’. Minkowsky dis-
tance function used as the distance metric. The code snippet for training this
classifier with the chosen parameters is given in Box 3.
Box 3: Code Snippet for Training k-Nearest Neighbors Classifier
neighclf = KNeighborsClassifier(n neighbors = 10, weights = ‘distance’,

metric = ‘minkowski’)
neighclf = neighclf.fit(traindata, truelabels)
Random Forest This classifier works by choosing random data points from the
training set and creating a set of decision tress. The final decision regarding the
class is made by aggreggation of the outputs from all the trees[19]. The number
of trees in the forest is set to 2 and ‘gini’ is used as a measure for quality of
a split. The maximum depth of trees is set to 5 and the maximum number of
features to be considered while searching for the best split is se to ‘auto’. The
minimum number of samples required to split an internal node is set to 2 and
the minimum number of samples needed to be at a leaf node is set to 3. The
number of parallel jobs to running for both fit and predict is set to 1. The code
snippet for training this classifier with the chosen parameters is given in Box 4.
Box 4: Code Snippet for Training Random Forest Classifier
RandomForestClassifier(n estimators = 2, criterion = ‘gini’, max depth

= 5, max features=‘auto’, min samples split=2, min samples leaf=3,
n jobs=1)
rndForstclf = rndForstclf.fit(traindata, truelabels)
Gaussian Naive Bayes This classifier works by using Bayesian theorem with
assumption of strong independence between the predictors(features). It is very
useful for large data sets as it is quite simple to build and has no complicated
iterative parameters[22]. This classifier does not have much to set when it comes
to configuring parameters. Prior probabilities of the classes is set to [0.5, 0.5] as
the number of headings is less as compared to other text. The code snippet for
training this classifier with the chosen parameters is given in Box 5.
Box 5: Code Snippet for Training Gaussian Naive Bayes Classifier
gaussianclf = GaussianNB(priors = [0.5, 0.5])

gaussianclf = gaussianclf.fit(traindata, truelabels)
Quadratic Discriminant Analysis It works under the assumption that the

measurements for each class are normally distributed while not assuming the co-
variance to be identical for all the classes. Discriminant analysis is used to choose
the best predictor variable(s) and is more flexible than linear models making it
better for a variety of problems[24]. Prior probabilities of the classes is set to
[0.5, 0.5] as the number of headings is far less as compared to other text. The
threshold used for rank estimation is set to 1e − 4. The code snippet for training
this classifier with the chosen parameters is given in Box 6.
Box 6: Code Snippet for Training Quadratic Discriminant Analysis Classifier
quadclf = QuadraticDiscriminantAnalysis(priors = [0.5, 0.5], tol =

0.0001)
quadclf = quadclf.fit(traindata, truelabels)
Logistic Regression It is a discriminative classifier, therefore it works by dis-

criminating amongst the different possible values of the classes[23]. Penalization
method is set to l2. The tolerance for stopping criteria is set to 2e − 4. The pa-
rameter ‘fit intercept’ is set to true adding a constant to the decision function.
The optimization solver used is ‘liblinear’ and the maximum number of itera-
tions taken for the solvers is set to 50. Multiclass is set to ‘ovr’ fitting a binary
problem for each label. The number of CPU cores used for parallelizing over
classes is set to 1. The code snippet for training this classifier with the chosen
parameters is given in Box 7.
Box 7: Code Snippet for Training Logistic Regression Classifier
logisticRegr = LogisticRegression(penalty=l2, tol=0.0002, fit intercept =

True, solver=‘liblinear’, max iter=50, multi class = ‘ovr’, n jobs=1)
logisticRegr = logisticRegr.fit(traindata, truelabels)
Gradient Boosting This classification method uses an ensemble of weak pre-

diction models in a stage wise manner. In each stage, a weak model is introduced
to make up for the limitations of the existing weak models[21]. The loss function
to be optimized is set as ‘deviance’ and learning rate is set to 0.1. The minimum
number of samples required to split an internal node is set to 2, the minimum
number of samples needed to be at a leaf node is set to 1 and maximum depth
of the individual regression estimators set to 3. The number of boosting stages
is set to 150 and the measure of quality of a split is set to ‘friedman mse’. The
code snippet for training this classifier with the chosen parameters is given in
Box 8.
Box 8: Code Snippet for Training Gradient Boosting Classifier
grdbstcf = GradientBoostingClassifier(loss = ‘deviance’, learning rate

= 0.1, min samples split = 2, min samples leaf = 1, max depth = 3,
n estimators = 150, subsample = 1.0, criterion = ‘friedman mse’)
grdbstcf = grdbstcf.fit(traindata, labels)
Neural Net This classifier works by imitating the neural structure of the brain.
One data point is processed at a time and the actual classification is compared to
the classification made by the classifier. Any errors recorded in the classification
process are looped back into algorithm to improve classification performance in
future iterations[27,25]. The classifier is configured to have one hidden layer with
100 units. The activation function used for the hidden layer is ‘tanh’. The solver
used for weight optimization is ‘lbfgs’. The batch rate is set to ‘auto’ and the
initial learning rate is set to 0.001. The parameter ‘max iter’ is set to 300, which
for ‘adam’ solver defines the number of epochs. Sample shuffle is set to true,
which enables sample shuffling in each iteration. The exponential decay rates for
estimates of the first and second moment vector is set to 0.9 and 0.999 respec-
tively. The code snippet for training this classifier with the chosen parameters
is given in Box 9.
Box 9: Code Snippet for Training Neural Net Classifier
nurlntclf = MLPRegressor(hidden layer sizes = (100, ), activation =

‘tanh’, solver = ‘lbfgs’, learning rate = ‘invscaling’, batch size = ‘auto’,
learning rate init = 0.001, max iter = 300, shuffle = True, beta 1 = 0.9,
beta 2 = 0.999)
nurlntclf = nurlntclf.fit(traindata, labels)
Table 3. Classifier Accuracy
Highest Value For Each Measure is Bold
Classifier Sensitivity Specificity Precision F1 Score Accuracy

Decision Tree 0.986 0.952 0.953 0.970 96.95 %
SVM 0.991 0.930 0.934 0.961 96.06 %
K-Nearest Neighbors 0.979 0.945 0.946 0.962 96.22 %
Random Forest 0.991 0.928 0.932 0.961 95.99 %
Gaussian Naive 0.981 0.912 0.917 0.948 94.66 %
Bayes
Quadratic Discrimi- 0.982 0.912 0.918 0.949 94.76 %
nant Analysis
Logistic Regression 0.982 0.904 0.911 0.945 94.34 %
Gradient Boosting 0.991 0.941 0.944 0.967 96.66 %
Neural Net 0.992 0.941 0.944 0.967 96.68 %
4 Test Results
Training and Prediction Time: When dealing with a large number of doc-
uments, the time required to train a model and make predictions is important
and is dependant on the type of classifier used, the number of features and the
amount of data points. In this research all classifiers are trained using the same
number of features and data points, therefore ‘time taken’provides a good mea-
sure of variations in training and prediction speed associated with each different
classifier being used. Of note, the training time for a classifier should be consid-
ered in context, as training only needs to be performed once and can be saved
for later use. Therefore, a model that takes a long time to train can still be
practical so long as it does not take a lot of time to make predictions. Fig 3.
shows time required for training and making predictions using these classifiers.
Time shown is average of 10 observations, which is done to reduce the effect of
programs running in the background on the comparison.
Fig. 3. Time required to train classifiers and run predictions on test data
Confusion Matrix Based Evaluation: We use evaluation parameters like

Sensitivity, Specificity, Precision, F1 score and Net Accuracy calculated using
confusion matrix to compare them to each other. Table 3 shows the results of
this evaluation.
ROC Curves and AUC: A receiver operating characteristics(ROC) curve is
used to visualize the trade-offs between sensitivity and specifity. These graphs
are used for performance based selection of classifiers. The graph can be reduced
to a numerical measure, AUC(or AUROC) which is the area under the ROC
graph with values ranging from 0 to 1[26]. Table 4 shows the AUC scores for the
classifiers used in this research. The discussion section provides more information
on how we used AUC score to select the best classifier.
Table 4. AUC Values for all Classifiers
Classifier AUC
Decision Tree 0.98
SVM 0.97
K-Nearest Neighbors 0.96
Random Forest 0.97
Gaussian Naive Bayes 0.96
Quadratic Discriminant Analysis 0.96
Logistic Regression 0.95
Gradient Boosting 0.98
Neural Net 0.98
5 Discussion
We recorded the time (in seconds) required for training each classifier and also
time for making predictions as shown in Fig 3. Time taken by a classifier to make
predictions is important when processing documents in bulk as it can increase
the processing time. Time taken to train a classifier only has to be done once
therefore it is not given that much importance. The Decision Tree Classifier took
the least time for training while Gradient Boosting took the most. On compar-
ing the prediction time Logistic Regression takes the least time and Random
Forest takes the most. While prediction time is not the most important factor
while choosing a classifier we take it into consideration when two classifiers are
performing approximately the same.
The top three classifiers based on net accuracy are Decision Tree, Gradient
Boosting, and Neural Network, however classifier selection can not solely rely on
accuracy[28,29]. Therefore, we also weigh the metrics like AUC, F1 score, sensi-
tivity, and specificity to choose the best suited classifier for detecting headings.
The top three classifiers in terms of F1 score, precision, sensitivity and speci-
ficity are Decision Tree, Gradient Boosting, and Neural Network and the top 3
in terms of AUC as shown in Table 4 are again Decision Tree, Gradient Boost-
ing, and Neural Network. The system is going to be dealing with documents in
bulk and the prediction time for Decision Tree is better when compared to both
Gradient Boosting and Neural Network. Therefore, we would be choosing our
configuration of the Decision Tree for making the classifications.
6 Testing The Generalizability
Testing the chosen classifier on a general set of documents is important to show

that it performs well on documents other than course outlines. We tested the
chosen Decision Tree classifier on 12,919 data points collected from documents
like reports and articles2 . These data points were manually tagged using a survey.
All the participants were graduate students from computer science department
and were asked to point out headings and subheadings in the documents. Table 5
shows the results which are equivalent if not better as compared to when tested
on course outlines.
Table 5. Test Results For General Set
Category Value
Total Data points 12919
Sensitivity 0.928
Specificity 0.966
Precision 0.964
F1 SCORE 0.946
Accuracy 94.73 %
AUC 0.97
Table 6. Pearson Correlation Coefficient Between Each Feature Used in the Selected
Classifier and Final Decision Labels
Feature Name Pearson Correlation Coefficient

Bold or Not 0.7022
Font Threshold Flag 0.2385
Words 0.1389
Verbs 0.1229
Nouns 0.1207
Cardinal Numbers 0.1201
Text Case 0.0660
7 Analysing The Results
The discussed configuration of Decision Tree is best suited to detect heading

as discussed in Section 5. Analyzing the contribution of each feature towards
the final decision made by the classifier is also important to understand the
implications of the results. Table 6 shows the pearson correlation coefficient
2
Repository available at: https://github.com/sahib-s/Generalizability/
for all the features used in the selected classifier and final decision label. The
list is in descending order of pearson correlation coefficient, therefore the top
feature in the table contribute the most towards the final decision. Each feature
was removed from the classifier one at a time and drop in evaluation metrics
also verify the order of contribution presented by using the pearson correlation
coefficient. Therefore, the top three contributing features are the ones that rely
on the physical attributes of the text.
8 Extending The Classifier

The extension of this work includes tagging of multiple labels like heading, para-
graph text, header/footer text and table text. While classifying paragraph text
is possible using the existing features, for properly classifying table text and
header/footer text more data features are necessary. We are currently looking
features from our white space detection approach discussed in chapter 5 and
bounding box data from PDF to XML conversion to provide the model with
what it needs to make this classification.
9 Conclusion
This research has provided a structured methodology and systematic evaluation
of a heading detection system for PDF documents. The detected headings pro-
vide information on how the text is structured in a document. This structural
information is used for extracting specific text from these documents based on
the requirements of the field of application. This supervised learning approach
has demonstrated good results and we are currently applying our configuration
of the Decision Tree classifier in the field of post-secondary curriculum analysis
to identify headings and extract learning outcomes from course outlines for a
research being conducted at DATALAB, Lakehead University, Canada.
Acknowledgment
This research would not have been possible without the financial support pro-
vided by Ontario Council on Articulation and Transfer (ONCAT) through Project
Number-2017-17-LU. We would also like to express our gratitude towards the
datalab.science team and Andrew Heppner for their support.
References
1. Khusro, Shah, Asima Latif, and Irfan Ullah. ”On methods and tools of table detec-
tion, extraction and annotation in PDF documents.” Journal of Information Science
41.1 2015.: 41-57.
2. Jiang, Deliang, and Xiaohu Yang. ”Converting PDF to HTML approach based on
Text Detection.” Proceedings of the 2nd International Conference on Interaction
Sciences: Information Technology, Culture and Human. ACM, 2009.
3. Lakehead University. Learning Objective Automated Gap Analysis. 2018,

www.loaga.science/.
4. Goslin, Kyle, and Markus Hofmann. ”Cross Domain Assessment of Document to
HTML Conversion Tools to Quantify Text and Structural Loss during Document
Analysis.” Intelligence and Security Informatics Conference (EISIC), 2013 Euro-
pean. IEEE, 2013.
5. Rahman, Fuad, and Hassan Alam. ”Conversion of PDF documents into HTML: a
case study of document image analysis.” Signals, Systems and Computers, 2004.
Conference Record of the Thirty-Seventh Asilomar Conference on. Vol. 1. IEEE,
2003.
6. Manabe, Tomohiro, and Keishi Tajima. ”Extracting logical hierarchical structure of
HTML documents based on headings.” Proceedings of the VLDB Endowment 8.12
2015.: 1606-1617.
7. El-Haj, Mahmoud, et al. ”Detecting document structure in a very large corpus of
UK financial reports.” 2014.: 1335-1338.
8. Guyon, I., Weston, J., Barnhill, S. et al. Machine Learning 2002. 389422.
9. Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with
Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”,
2009.
10. Bienz, Tim, Richard Cohn, and Adobe Systems (Mountain View, Calif.). Portable
document format reference manual. Reading, MA, USA: Addison-Wesley, 1993.
11. Google. Custom Google Search API. Google, Google, 2018,
developers.google.com/custom-search/json-api/v1/overview.
12. Shinyama, Yusuke. ”PDFMiner: Python PDF parser and analyzer.”
13. Goyvaerts, Jan, and Steven Levithan. Regular expressions cookbook. O’reilly, 2012.
14. Lior, Rokach. Data mining with decision trees: theory and applications. Vol. 81.
World scientific, 2014.
15. Ben-Hur, Asa, and Jason Weston. ”A users guide to support vector machines.”
Data mining techniques for the life sciences. Humana Press, 2010. 223-239.
16. Chawla, Nitesh V., et al. ”SMOTE: synthetic minority over-sampling technique.”
Journal of artificial intelligence research 16 (2002): 321-357.
17. D. T. Larose, ”k-nearest neighbor algorithm” in Discovering Knowledge in Data:
An Introduction to Data Mining, Hoboken, NJ, USA:Wiley, pp. 90-106, 2005.
18. Pedregosa, Fabian, et al. ”Scikit-learn: Machine learning in Python.” Journal of
machine learning research 12.Oct (2011): 2825-2830.
19. Louppe, Gilles. ”Understanding random forests: From theory to practice.” arXiv
preprint arXiv:1407.7502 2014.
20. Parikh, Rajul, et al. ”Understanding and using sensitivity, specificity and predictive
values.” Indian journal of ophthalmology 56.1 2008.: 45.
21. Ridgeway, Greg. ”The state of boosting.” Computing Science and Statistics 1999.:
172-181.
22. Rish, Irina. ”An empirical study of the naive Bayes classifier.” IJCAI 2001 work-
shop on empirical methods in artificial intelligence. Vol. 3. No. 22. IBM, 2001.
23. Pohar, Maja, Mateja Blas, and Sandra Turk. ”Comparison of logistic regression
and linear discriminant analysis: a simulation study.” Metodoloski zvezki 1.1 2004.:
143.
24. Sueyoshi, Toshiyuki. ”DEA-discriminant analysis: methodological comparison
among eight discriminant analysis approaches.” European Journal of Operational
Research 169.1 2006.: 247-272.
25. Graupe, Daniel. Principles of artificial neural networks. Vol. 7. World Scientific,
2013.
26. Fawcett, Tom. ”An introduction to ROC analysis.” Pattern recognition letters 27.8
2006.: 861-874.
27. Mago, Vijay Kumar, ed. Cross-Disciplinary Applications of Artificial Intelligence
and Pattern Recognition: Advancing Technologies: Advancing Technologies. IGI
Global, 2011.
28. Huang, Jin, and Charles X. Ling. ”Using AUC and accuracy in evaluating learning
algorithms.” IEEE Transactions on knowledge and Data Engineering 17.3 2005.:
299-310.
29. Ling, Charles X., Jin Huang, and Harry Zhang. ”AUC: a better measure than
accuracy in comparing learning algorithms.” Conference of the canadian society for
computational studies of intelligence. Springer, Berlin, Heidelberg, 2003.
A snapshot on nonstandard supervised learning problems
Taxonomy, relationships and methods
David Charte · Francisco Charte · Salvador Garcı́a · Francisco Herrera

arXiv:1811.12044v1 [cs.LG] 29 Nov 2018
This is a pre-print of an article published in Progress in Artificial Intelligence. The final authenticated version is available online at: https:
//doi.org/10.1007/s13748-018-00167-7
Abstract Machine learning is a field which studies how ma- Keywords Machine learning · Supervised learning ·
chines can alter and adapt their behavior, improving their Nonstandard learning
actions according to the information they are given. This
Mathematics Subject Classification (2010) MSC 68T05 ·
field is subdivided into multiple areas, among which the
MSC 68T10
best known are supervised learning (e.g. classification and
regression) and unsupervised learning (e.g. clustering and
association rules).
1 Introduction
Within supervised learning, most studies and research
are focused on well known standard tasks, such as binary According to Mitchell [80], a machine is said to learn from
classification, multiclass classification and regression with experience E related to a class of tasks T and performance
one dependent variable. However, there are many other less metric P, when its performance at tasks in T improves ac-
known problems. These are what we generically call non- cording to P after experience E.
standard supervised learning problems. The literature about Supervised learning is one of the fundamental areas of
them is much more sparse, and each study is directed to a machine learning [78]. From object detection to ecological
specific task. Therefore, the definitions, relations and appli- modeling to emotion recognition, it covers all kinds of ap-
cations of this kind of learners are hard to find. plications. It essentially consists in learning a function by
The goal of this paper is to provide the reader with a training with a set of input-output pairs. The training stage
broad view on the distinct variations of nonstandard super- can be seen as E in the previous definition, and the specific
vised problems. A comprehensive taxonomy summarizing task T may vary, but usually involves predicting an appro-
their traits is proposed. A review of the common approaches priate output given a new input.
followed to accomplish them and their main applications is Traditionally, supervised learning problems have been
provided as well. spread into two categories: classification and regression [43,
60]. In the first, information is divided into discrete cate-
gories, while the latter involves patterns associated to a value
D. Charte in a continuous spectrum.
Universidad de Granada, Granada, Spain
E-mail: fdavidcl@ugr.es
These problems can be processed by learning from a
training dataset, which is composed of instances. Typically,
F. Charte
Universidad de Jaén, Jaén, Spain
these instances or samples take the form (x, y) where x is a
E-mail: fcharte@ujaen.es vector of values in the space of input variables and y is a
S. Garcı́a
value in the target variable. Each problem can be described
Universidad de Granada, Granada, Spain by the type of its instances: inputs will usually belong to
E-mail: salvagl@decsai.ugr.es a subset of Rn , and outputs will take values in a specific
F. Herrera one-dimensional set, finite or continuous. Once trained, the
Universidad de Granada, Granada, Spain obtained model can be used to predict the target variable on
E-mail: herrera@decsai.ugr.es unseen instances.
2 David Charte et al.
Standard classification problems are those where labels further from the ones previously detailed. Lastly, Section 7
are either binary or multiclass [33, 105]. In the binary case, draws some conclusions.
an instance can only be associated with one of two values:
positive or negative, which is equivalent to 0 or 1. For ex-
ample, email messages may be classified into spam or legit, 2 Definitions of nonstandard variations
and tumours can be categorized as either benign or malign.
Multiclass problems, on the other hand, involve any finite The problems introduced in this section are generalizations
number of classes. That is, any given instance will belong to over the traditional versions of classification and regression.
one of possibly many categories, which is equivalent to it be- The focus is on fully supervised problems, where inputs are
ing assigned a natural number below a convenient threshold. always paired with outputs during training. An alternative
As an example, a photograph of a plant or a sound record- taxonomy based on different supervision models is intro-
ing from an animal could correspond to one of a variety of duced in [54].
species.
A standard regression problem [61, 99] consists in find- 2.1 Notation
ing a function which is able to predict, for a given example,
a real value among a continuous range, usually an interval In this work we will establish a notation which intends to be
or the set of real numbers R. For example, the height of a as simple to understand as possible, while being able to en-
person may be estimated out of several characteristics such compass every nonstandard variation. First, any supervised
as age or country of origin. learning problem consists in finding a function which will
Even though these standard problems are applicable in a classify, rank or perform regression. It will be noted as
multitude of cases, there are situations whose correct mod- f :X →Y (1)
eling requires modifications of their structure. For example,
a newspaper article can be categorized according to its con- where X is an input set, or domain, and Y is an output set,
tents, but it could be desirable to assign several categories or codomain. It will be assumed that a training dataset S is
simultaneously. Similarly, a social media post could be de- provided, including a finite number of input-output pairs:
scribed by not one but two input vectors, an image and a
piece of text. These special circumstances cannot be covered (x, y) ∈ S ⊂ X ×Y . (2)
by the traditional one-vector input and one-dimensional out- This way, a learning algorithm will be able to generate the
put schema. As a consequence, since performance metrics desired function f . An additional notation will be the set of
which measure improvements in standard tasks assume the labels L where convenient.
common structure, they lose applicability or sense in these For example, in standard binary classification X ⊂ Rn
cases. Thus, not only new techniques are needed to tackle and Y = L = {0, 1}. Similarly, standard regression prob-
the problems, but also new ways of measuring and compar- lems can be defined with the same kind of X set and Y ⊂ R.
ing their success. Thus, we can define very distinct supervised problems by
This work studies variations on classic supervised prob- particularizing sets X or Y in different ways.
lems where the traditional structure is not obeyed, which we Other usual notations are based in probability theory,
call nonstandard variations. These emerge when the struc- thus involving random variables and probability distribu-
ture of the classical components of the problems does not tions [115, 83]. In that case, X and Y would be the sample
suffice to describe complex situations, such as multiplic- spaces of the input and output variables X and Y, respec-
ity of inputs or outputs, or order restrictions. As a conse- tively. Predictors would usually attempt to infer a discrimi-
quence, this manuscript does not cover other singular super- nant model P(Y|X) from the training dataset.
vised problems, such as high dimensionality of the feature
space [10] or unbalanced training sets [40, 67], nor time-
dependent problems, such as data streams [46, 98] or time 2.2 Multi-instance
series [58].
The multi-instance (MI) framework [56] assumes a single
The rest of the paper is structured as follows. Section 2 feature space for all instances, but each training pattern may
formally defines and describes each nonstandard variation. consist of more than one instance. In this case, a training
This is followed by Section 3 establishing relations among pattern is composed of a finite multiset or bag of instances
the introduced problems and proposing a taxonomy of them. and a label. Formally, assuming instances are drawn from a
Section 4 describes the most common techniques used to set A ⊂ Rn , the domain can be described as follows:
solve them. After that, Section 5 enumerates popular appli-
cations of each problem. Section 6 covers other variations X = {b ⊂ A | b finite} . (3)
A snapshot on nonstandard supervised learning problems 3
In this case, the learning algorithm will not know labels a photograph labeled ocean is less likely to have the moun-
associated to each instance but to a bag of them. In addition tains label rather than beach. Methods may take advantage
to this, not all instances may share the same relevance or are of label co-ocurrence [18] in order to reduce the search space
equally related to the label. when predicting a labelset.
Some MI problems assume that hidden labels are present
for each instance in a bag: for example, a training set of drug
tests where, for each test, several drug types are analyzed. 2.5 Multi-dimensional
Additionally, a typical MI assumption in the binary scenario
states that a bag is positive when at least one of its instances Multi-dimensional (MD) learning [96] is a generalized clas-
is positive, and it is negative otherwise [41]. sification problem where categorization is performed simul-
Other MI problems differ in that a per-instance labeling taneously along several dimensions. Each instance can be-
may not be possible or may not make sense: for example, if long to one of many classes in each dimension, thus the out-
each bag represents an image and instances are image seg- put space can be formally described as:
ments, class beach can only apply to bags with water and
Y = L1 × L2 × · · · × L p , (6)
sand segments, but it cannot apply to an individual instance.
where Li is the label space for the i-th dimension.
As with ML learning, label dimensions may be related
2.3 Multi-view in some way and treating them independently would only be
a naive solution to the problem.
A learning problem is considered to be multi-view (MV)
[120] when inputs are composed of several components of
very different nature. 2.6 Label distribution learning
For example, if a learning pattern consists of an image as
well as a piece of text representing the same instance, they In label distribution learning (LDL) problems [47], other-
can be seen as two views on it. In that case, images and texts wise known as probabilistic class label problems [75], any
would belong to distinct feature spaces A and B respectively, instance can be described in different degrees by each la-
an input pattern being (a, b) ∈ A × B . More generally, we bel. This can be modeled as a discrete distribution over the
can describe the input space as: labels, where the probability of a label given a specific in-
stance is called its degree of description. Analitically, the
t
objective is, for each instance, to predict a real-valued vec-
X = ∏ Ai , where Ai ⊂ Rni , (4)
i=1 tor which sums exactly 1:
( )
p
where t is the number of views offered by the problem and
Y= y ∈ [0, 1] p : ∑ yi = 1 . (7)
ni is the dimension of the feature space of the i-th view. i=1
In this case, we would say that the i-th label in L de-

2.4 Multi-label scribes an instance (x, y) with degree yi .
The multi-label (ML) learning field [55, 48] studies prob-

lems related to simultaneously assigning multiple labels to 2.7 Label ranking
a single instance. That is, if L = {l1 , . . . , l p } the codomain
consists of all possible selections of these p labels, also known In a label ranking (LR) problem [57, 113] the objective is
as labelsets: not to find a function able to choose one or several labels
from the label space. Instead, it must evaluate their rele-
Y = 2L ∼ p
= {0, 1} . (5) vance for each unseen instance. The most general version
of the problem involves a training set where Y is the set of
As shown by this formulation, it is equivalent to think of a all partial orders of L , and the obtained function also maps
selection of labels as a subset of L and as a binary vector. individual instances to partial orders. This way, for each test
For example, the labelset composed of the first and third la- instance the function will output a sequence of preferences
bels can be represented either by {l1 , l3 } or (1, 0, 1, 0, . . . , 0). where some labels will be seen as more relevant than others.
The difference that arises when comparing ML problems However, the typical situation in label ranking problems
to binary or multiclass ones is that labels may interact with is that the orders are total, which means any two labels can
each other. For example, a news piece classified in economy always be compared. This is called a ranking and does not
is more likely to be labeled politics than sports. Similarly, exclude the possibility of ties. When ties are not allowed it
is said to be a sorting or permutation, and can be formulated When inputs as well as outputs are at least partially or-
as follows: dered, it is common to look for predictions which respect
their order relations. In that case, the objective is to obtain a
Y = {σ : {1, . . . , p} → L | σ is bijective} , (8) classifier or regression function which enforces the follow-
where p is the amount of labels. Y can also be seen as the ing constraint:
set of all permutations of the labels in L , usually known as
x1 < x2 ⇒ f (x1 ) < f (x2 ) ∀x1 , x2 ∈ X . (11)
the symmetric group of order p, and noted as S p .
When Y is discrete the problem is usually called mono-
tone classification (MC), monotonic classification or ordinal
2.8 Multi-target regression classification with monotonicity constraints [51]. If, on the
contrary, Y is continuous, it is known as isotonic regression
A regression problem where the output space has more than
(IR) [6].
just one dimension is usually called multi-target regression
(MTR) and is also known as multi-output, multi-variate or
multi-response [11]. In this case, a formal description is sim- 2.11 Absence or partiality of information
ply that the codomain is a continuous multi-dimensional real
set: Some problems do not directly alter the structure of X and Y
p from the standard supervised problem. Instead, they restrict
Y = ∏ Yi , where Yi ⊂ R ∀i (9) which data can belong to a training set, or remove labelings
i=1
from training examples. In this case, training information is
and p is the number of target variables. presented partially or with some exclusions.
As with other multiple target extensions, the key differ- According to which kind of information is missing from
ence with single-target regression in this case is the possible the training set, a learning task can usually be categorized as
interactions among output variables. semi-supervised [16], one-class learning [81], PU-learning
[37], zero-shot learning [86] or one-shot learning [39]. These
are described further in Section 6.1.
2.9 Ordinal regression
A problem where the target space is discrete but ordered is 2.12 Variation combinations
called ordinal regression (OR) or, alternatively, ordinal clas-
sification [52]. It can be located midway between classifica- Some of the components described above can be combined
tion and regression. More specifically, it consists in labeling to compose a more complex problem overall. Usually, one
instances with a finite number of choices where these are of these combinations will take components from different
ordered variation types, for example, simultaneous multiplicity of
inputs and outputs.
Y = {1, 2, . . . , c} , 1 < 2 < · · · < c . (10)
More specifically, there exist several studies involving
In OR, the training phase consists in learning from a MI ML scenarios [122, 103]. In this case, examples from the
set of feature vectors which have a specific label associated input space are composed of several feature vectors and are
to them, and testing can be performed over individual in- associated to various labels. As a consequence, this model
stances. This means that, although labels are ordered, the can represent many complicated problems where inputs and
main objective is not to rank or sort instances as in learning outputs have more structure than usual.
to rank [13], but to simply classify them. The labels them- Other more uncommon situations are MV MI ML prob-
selves do not provide any metric information either, they lems [84], where patterns have several instances which may
only carry qualitative information about the order among or may not belong to the same space, a multi-output ver-
themselves. sion of OR named graded ML classification [22] and more
complex input structures such as multi-layer MI MV [116],
where a hierarchy of instances is present in each example.
2.10 Monotonicity constraints
Order relations can exist not only in the label space but in 3 Taxonomy
the feature space as well. Partial orders among real-valued
feature vectors are always possible, and there may be cases A first categorization of the variations analyzed in this work
where the order among instances is determined by just one can be made according to how they differ from the standard
or a few of their attributes. problem. There can be multiplicity in the input space or the
output space, order constraints may exist, or only partial in- There are also problems where there exists a partial or
formation may be given in some cases. Fig. 1 shows ways in total order among instances, which is coupled with an order
which the traditional problems can be generalized. constraint in relation to the outputs. These are MC and IR.
Fig. 2 summarizes these structural traits in a hierarchy
and indicates problems where these traits are present.
Multiple inputs
(MI, MV)
Unordered
(standard)
Single feature
vector
Multiple outputs
Order constraints
(ML, LR, MD, Ordered (MC, IR)
LDL, MTR) Standard problem (OR, MC, IR)
Input structure
traits
Different
space (MV)
Partial information Multiple
(SSL, PU, 0shot, feature vectors
1shot, 1class)
Same space (MI)
Fig. 2 Traits that can be found on the input structure of supervised

problems.
Fig. 1 Extensions of the standard supervised problem: multiple inputs

or outputs, presence of orders and rankings, and partial information.
3.2 Output structure

Problems introducing multiple inputs are MI and MV,
whereas multiple outputs can be found on ML, MD, LR, The diversity in output variations is higher than that of the
LDL and MTR. Problems where orders are present are OR, input ones. A first sorting criterion is whether the codomain
MC and IR. Likewise, tasks with only partial information is discrete or continuous. This way, problems are either clas-
are, among others, semi-supervised learning, one-shot classification or regression ones.
sification and zero-shot classification. Further subdivision of problems allows to separate these
Finally, a generalized problem can be built out of com- traits according to whether outputs remain scalars or become
bining several of these components: for example, a multiple- vectors. In the first case we consider order in the discrete
input multiple-output problem where the inputs and outputs scenario a nonstandard variation, which is present in OR and
can belong to structures like the ones defined above. MC. In the second case, classification problems are spread
The rest of this section studies variations on the struc- into ML, LR and MD, and regression ones into LDL and
ture of the input space and output space, establishes relations MTR.
among problems, and describes how they can be particular- Fig. 3 organizes these traits in a hierarchy based on the
ized or generalized to one another. previous criteria. Each leaf of the tree also includes prob-
lems where each one is present.
The variations in the structure of target spaces in super-
3.1 Input structure vised problems can be seen as generalizations of the stan-
dard problems. Furthermore, some of them are also more
In a standard supervised problem, the input space consists of general than others. For example, ML problems can be seen
single feature vectors and does not impose a specific order. as LR ones where, for a given instance, labels over a thresh-
Problems where learning patterns are composed of mul- old are active and those below are not. Thus, LR is a gener-
tiple instances can usually be categorized into either MI, if alization of the ML scenario. More relations of this kind are
the inputs share the same structure, or MV, otherwise. Their displayed in Fig. 4.
combination can also be considered as well, e.g. a problem As shown in the graph, an inclusion of more target vari-
where an example is composed of one or more photographs ables of the same type transforms a binary problem into ML,
and one or more pieces of text. This would be a case of a a multiclass problem into MD and a single-target regression
MV MI problem. one into MTR. Similarly, inclusion of more values into each
Output structure traits
Discrete Continuous
Scalar (standard
Scalar Multiple Multiple
regression)
Unordered
Ordered Distribution Unrestricted
(standard Binary (ML) Ranking (LR) Finite (MD)
(OR, MC) (LDL) (MTR)
classification)
Fig. 3 Traits that can be found on the output structure of supervised problems.
variable allows to generalize binary problems to multiclass, In the following subsections several methods based on
and ordinal to single-target regression, as well as ML ones both approaches are enumerated for each analysed problem.
to MD and these to MTR. LDL can be seen as a general-
ization of ML where real numbers between 0 and 1 are also
allowed as values for a label. LR is a generalization of ML
4.1 Problem transformation
by the argument discussed before.
Problem transformation methods assume that a solution can
Multi-label Binary Multiclass be achieved by extracting one or more simpler problems
out of the original one. For example, a problem with multi-
dimensional targets could be transformed into many prob-
lems with scalar outputs. Then, these problems could be
Label ranking Multi-dimensional Ordinal
solved independently by a classical algorithm. A solution
for the original problem would be the concatenation of those
extracted from the simpler ones.
Label distribution Multi-target Standard Next, the most common transformation techniques are
learning regression regression
described for each nonstandard supervised learning task pre-
Fig. 4 Relations among supervised problems according to output viously introduced.
structure. Arrows follow natural generalizations from one problem to
another. Continuous arrows denote generalizations based on adding
more variables of the same type. Dashed arrows indicate generaliza-
tions based on modifying existing target variables.
– MI. The taxonomy proposed in [3] describes an Embed-
ded Space paradigm, where each bag is transformed into
a single feature vector representing the relevant informa-
tion about the whole bag. This transformation brings the MI
problem into a single-instance one. Most of these methods
3.3 Summary are vocabulary-based, which means that the embedding uses
a set of concepts to classify each bag according to its in-
In this section input and output variations of standard super- stances, resulting in a single vector with one component per
vised problems have been categorized and related. Table 1 concept.
allows to identify specific problems according to which in-
put and output traits are present.
– MV. Some naive transformations consist in ignoring every
view except one, or concatenating feature vectors from all
4 Common approaches to tackle nonstandard problems views, thus training a single-view model in both cases [68].
A preprocessing based on Canonical Correlation Analysis
When tackling a nonstandard problem, most techniques fol- [19] is able to project data from multiple views onto a lower-
low one of two main approaches: problem transformation dimensional, single-view space.
or algorithm adaptation. The first one relies on appropriate
transformations of the data which result in one or more sim-
pler, standard problems. The latter implies an extension or – ML. Transformation methods for ML classification [118]
development of previously existing algorithms, in order to are diverse: Binary Relevance trains separate binary classi-
adapt them to the complexities induced by the structure of fiers for each label. Label Powerset reduces the problem to
the data. a multiclass one by treating each individual labelset as an
aa Unordered outputs Ordered outputs

aOutputs Scalar Multiple Scalar Multiple
Inputsaa Discrete Continuous Discrete Continuous
a
Unordered inputs standard classification [43] ML/MD classification [55, 96] OR [52] standard regression [99] Graded ML [22] MTR [11]
Ordered inputs - - MC [51] IR [6] - -
Multiple instances MI classification [56] MIML/MIMD classification [122] - MI regression [56] - -
Multiple views MV classification [120] MVML/MVMD classification [84] - MV regression [120] - -
Table 1 Identification of problems according to their input traits (vertical axis) and output traits (horizontal axis).
independent class label, and Random k-Labelsets [108] ex- – OR. An ordinal problem with c classes can be transformed
tracts an ensemble of multiclass problems similarly. Classi- into c − 1 binary classification problems by using each class
fier chains [91] trains subsequent binary classifiers accumu- from the second to the last one as a threshold for the pos-
lating previous predictions as inputs. ML problems can also itive class [42]. This decomposition can be called ordered
be transformed to LR [44]. partitions and is not the only possible one: others are one-
vs-next, one-vs-followers and one-vs-previous [52]. Several
– MD. In some cases, independent classifiers can be trained 3-class problems can also be obtained by using, for the i-th
for several dimensions [96, 87] but this method ignores pos- problem, classes “li ”, “< li ” and “> li ”.
sible correlations among dimensions. An alternative trans-
formation, building a different label from each combination – MC. The authors in [65] describe a procedure to tackle
of classes, would produce a much larger label space and thus binary MC problems by means of IR. Multiclass MC cases
is not typically applied. can be reduced to several binary MC ones, which in turn are
solved as IR problems.
– LDL. A LDL problem can be reduced to multiclass clas-
sification by extracting as many single-label examples as la-
4.2 Algorithm adaptation
bels for each one of the training instances [47]. These new
examples are assigned a class corresponding to each label Existing methods for classical problems can be extended in
and weighted according to its degree of description. During order to introduce the necessary complexities of nonstan-
the prediction process, the classifier must be able to output dard variations. As an example, nearest neighbor methods
the score/confidence for each label, which can be used as its could be coupled with new distance metrics in order to be
description degree. able to measure similarity among multiple inputs.
The rest of this section presents some algorithm adap-
– LR. A reduction of this problem to several binary prob- tations which can be used to tackle nonstandard supervised
lems can be achieved by learning pairwise preferences [57]. tasks.
This transforms a c-label problem into c(c − 1)/2 binary
problems describing a comparison among two labels. An – MI. Methods that work on instance level are adaptations
alternative reduction by means of constraint classification of algorithms from single-instance classification whose re-
[53] builds a single binary classification dataset by expand- sponses are then aggregated to build the bag-level classifi-
ing each label preference into a new positive instance and a cation [3]. They typically assume that one positive instance
new negative instance. The feature space of the new binary implies a positive bag. Adaptations of common algorithms
problem has dimension nc, where n is the original dimension have been proposed with support vector machines (SVM)
and c the number of labels, due to the constraints embedded [4] and neural networks [90], whereas some original meth-
in it by Kesler’s construction [85]. ods in this area are Axis-Parallel Rectangles [31] and Di-
verse Density [77]. In the bag-space paradigm, methods treat
– MTR. There are several ways to transform a MTR prob- bags as a whole and use specific distance metrics with dis-
lem into several single-target regression ones. Some of them tance as well as kernel-based classifiers, such as k-nearest
are inspired by the ML field, such as a one-vs-all single- neighbor (k-NN) [114] or SVM [121].
target reduction, multi-target stacking and regressor chains
[101]. All of them train single-target regressors for several – MV. Supervised methods for MV are comparatively less
extracted problems, and then combine the obtained predic- developed than semi-supervised ones. Nonetheless, there is
tions. A different approach based on support vectors [119] an extension of SVM [38] which simultaneously looks for
extends the feature space which expresses the multi-output two SVMs, one in each of the feature spaces of a two-view
problem as a single-target one that can be solved using least problem. There is an extension of Fisher discriminant anal-
squares support vector regression machines. ysis as well [20].
– ML. The most relevant algorithm adaptations [118] are Task Problem transformation Algorithm adaptation
based on standard classification algorithms with added sup- MI Embedded-space [3] SVM [4, 121]
port for choosing more than one class at a time: adaptations Neural networks [90]
k-NN [114]
exist for k-NN [117], decision trees [24], SVMs [36], asso-
MV Canonical correlation analysis [19] SVM [38]
ciation rules [106] and ensembles [82]. Fisher discriminant analysis [20]
ML Binary Relevance [118] k-NN [117]
– MD. Specific Bayesian networks have been proposed for Label Powerset [118] Decision trees [24]
Classifier chains [91] SVM [36]
the MD scenario [8, 26], as well as Maximum Entropy-based Association rules [106]
Ensembles [82]
algorithms [96, 87].
MD Independent classifiers [96, 87] Bayesian networks [8, 26]
Maximum Entropy [96, 87]
– LDL. Proposals in [47] are adaptations of k-NN, with a LDL Multiclass reduction [47] k-NN [47]
special derivation of the label distribution of an unseen in- Neural networks [47]
LR Pairwise preferences [57] Boosting [28]
stance given its neighbors, and backpropagated neural net- Constraint classification [53] SVM [113]
works, where the output layer indicates the label distribution Perceptron [95]
of an instance. Other proposed methods are based on the op- MTR ML [101] Generalizations [59, 111]
Support vectors [119] Support vector regression [112,93]
timization algorithms BFGS and Improved Iterative Scaling. Kernel-based [79, 1]
Regression trees [27]
Random forests [64]
– LR. Boosting methods have been adapted to LR [28], as OR Ordered partitions [42] Neural networks [25, 21]
well as the SVM proposed in [36] for ML which can be natu- One-vs-next, One-vs-followers, Extreme learning machines [30,94]
One-vs-previous [52] Decision trees [14]
rally extended to LR [113]. An adaptation of online learning 3-class problems [52] Gaussian processes [23]
algorithms such as the perceptron has also been developed AdaBoost [73]
[95]. MC Reduction to IR [65] k-NN [34]

Decision trees [89]
Decision rules [29, 9]
Neural networks [97]
– MTR. First methods able to treat MTR problems were ac-
tually generalizations of statistical methods for single-target Table 2 Summary table of presented methods according to their type
regression [59, 111]. Other common methods which have of approach.
been extended to predict multiple regression variables are
support vector regression [112, 93], kernel-based methods
[79,1], and regression trees [27] as well as random forests – MI. Problems modeled under MI learning are drug activ-
[64]. ity prediction [31], where each pattern describes a molecule
and its different forms are represented by instances; image
– OR. Neural networks can be used to tackle OR with slight classification [3], and bankruptcy [66]. Most of the datasets
changes in the loss function or the output layer [25, 21]. Sim- used in experimentations, however, are usually synthetic.
ilarly, extreme learning machines have also been applied to
this problem [30, 94]. Common techniques such as k-NN or
decision trees have been coupled with global constraints for – MV. Some situations where data is described in multiple
OR [14], and extensions of other well known algorithms views are multilingual text categorization [2], face detection
such as Gaussian processes [23] and AdaBoost [73] have with several poses [72], user localization in a WiFi network
been proposed as well. [88], advertisements described by their image and surround-
ing text [102] and image classification with several color-
– MC. Algorithm adaptations generally take a well known based views and texture-based views [110].
technique and add monotonicity constraints. For example,
there exist in the literature adaptations of k-NN [34], de-
cision trees [89], decision rules [29, 9] and artificial neural – ML. Problems which fall naturally under the ML defini-
networks [97]. tion are text classification under several categories simul-
taneously [62], image labeling [12], question tagging in fo-
Table 2 gathers all the methods described previously to rums where tags can co-exist [17], protein classification [32].
tackle nonstandard supervised tasks.
– MD. Applications of MD classification include classifi-

5 Applications. Original real word scenarios cation of biomedical text [96], where predicted dimensions
for a given document are its focus, evidence type, certainty
The problems studied in this work have their origins in real- level, polarity and trend; gene function identification [8]; tu-
world scenarios which are related below: mor classification, and illness diagnosis in animals [26].
– LR. The field known as preference learning has been gain- A different scenario arises when the training set only
ing interest [57], and LR is one of the problem that falls un- consists of negative (or only positive) instances, and no un-
der this term. LR is also frequently applied in ML scenarios labeled examples are provided. This is known as one-class
[45], where a threshold can be applied in order to transform classification [81], and data of this nature can be obtained
an obtained ranking into a labelset. from outlier detection applications, where positive examples
are hardly recorded.
– LDL. Data with relative importance of each label appears A problem which may be seen as a generalization of
in applications such as analysis of gene expression levels in one-class classification is zero-shot learning [86], a situation
yeast [35], or emotion description from facial expressions where unseen classes are to be predicted in the testing stage.
[76], where a face can depict several emotions in different That is, the label space Y includes some values which are
grades. not present in any training pattern, but the classifier must be
able to predict them. For example, if in a speech recognition
problem Y is the set of all words in English, the training set
– MTR. Applications modeled as MTR problems are di- is unlikely to have at least one instance for each word, thus
verse, including modeling of vegetation condition in ecosys- the classifier will only succeed if it is capable of assigning
tems assigning several scores which depend on the vegeta- unlearned words to test examples.
tion type [63], prediction of audio spectrums of wind tunnel A relaxation on the obstacles of zero-shot learning is
tests [69], and estimation of several biophysical parameters present in one-shot learning [39], where algorithms attempt
from remote sensing images [109]. to generalize from very few (1 to 5) examples of each class.
This is a common circumstance in the field of image classi-
– OR. The most salient fields where OR can be found are fication, where the cost of collecting and labeling data sam-
text classification [5], where the predicted variable may be ples is high.
an opinion scale or a degree of satisfaction; image catego- A classification of these problems according to the type
rization [107]; medical research [7]; credit rating [70], and of missing information can be found in Table 3.
age estimation [15].
Trait Problem types

– MC. Monotonicity constraints are found in problems re-
Presence of unlabeled instances Semi-supervised [16]
lated to customer satisfaction analysis [50], in which overall Positive-unlabeled [37]
appreciation of a product must increase along with the eval-
No representation of some classes One-class [81]
uation of its features; house pricing [89]; bankruptcy risk Positive-unlabeled [37]
evaluation [49], and cancer prediction [92], among others. Zero-shot [86]
Scarce representation of some classes One-shot [39]
Table 3 Partial information problems according to the kind of absence

6 Other nonstandard variations
in the training set.
This section covers variations of the standard supervised

problem which are further from the central focus of this pa-
per less related to those above.
6.2 Prediction of structured data
6.1 Learning with partial information The nonstandard variations described in this work general-
ize traditional supervised problems where the predicted out-
In a standard supervised classification setting, it is assumed put is at most a vector whose components take values in ei-
that every training example is labeled accordingly and that ther a finite set or R. Further generalizations are possible if
there exist examples for every class that may appear in the other kinds of structures are allowed. For example, the target
testing phase. When only a fraction of the training instances may take the form of an ordered sequence or a tree. In this
are labeled, the problem is considered semi-supervised [16], case, the problem usually enters the scope of structured pre-
but generally there still exist labeled samples for each class. diction [104], a generalization of supervised learning where
In positive-unlabeled learning [37, 74], however, labeled methods must build structured data associated to input in-
examples provided within the training set are only positive. stances.
This means the learning algorithm only knows about the A particular case of supervised problem which can be
class of positive instances, and unlabeled ones can have ei- seen under the umbrella of structured prediction is learn-
ther class. ing to rank [13], which does not involve a label space as
such. Instead, training consists in learning from a set of fea- 3. Amores, J.: Multiple instance classification: Review, taxonomy
ture vectors with a series of preferences among them, that and comparative study. Artificial Intelligence 201, 81 – 105
(2013). DOI https://doi.org/10.1016/j.artint.2013.06.003
is, a partial or total order in the training set. During testing 4. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector ma-
a set of feature vectors is provided and the desired output chines for multiple-instance learning. In: Advances in neural in-
is a ranking (with a predefined number of relevance lev- formation processing systems, pp. 577–584 (2003)
els, allowing ties) or a sorting (simply an ordering of the 5. Baccianella, S., Esuli, A., Sebastiani, F.: Feature selection for
ordinal text classification. Neural computation 26(3), 557–591
instances). This problem differs from OR in that individ- (2014). DOI 10.1162/NECO\ a\ 00558
ual classifications are usually meaningless: only relative dis- 6. Barlow, R.E.: Statistical inference under order restrictions; the
tances among ranked instances matter. theory and application of isotonic regression. Wiley (1972)
7. Bender, R., Grouven, U.: Ordinal logistic regression in medical
research. Journal of the Royal College of physicians of London
31(5), 546–551 (1997)
7 Conclusions 8. Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification
with bayesian networks. International Journal of Approximate
Traditional supervised learning comprises two well known Reasoning 52(6), 705–727 (2011)
9. Błaszczyński, J., Słowiński, R., Szelag, M.: Sequential covering
problems in machine learning: classification and regression. rule induction algorithm for variable consistency rough set ap-
However, the multitude of applications which do not strictly proaches. Information Sciences 181(5), 987–1002 (2011). DOI
fit the structure of the standard versions of those problems 10.1016/j.ins.2010.10.030
have favored the development of alternative versions which 10. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos,
A.: Feature Selection for High-Dimensional Data.
are more flexible and allow the analysis of more complex Springer International Publishing, Cham (2015). DOI
situations. 10.1007/978-3-319-21858-8. URL https://doi.org/10.
In this work an overview of nonstandard variations of 1007/978-3-319-21858-8
supervised learning problems has been presented. A novel 11. Borchani, H., Varando, G., Bielza, C., Larrañaga, P.: A survey on
multi-output regression. Wiley Interdisciplinary Reviews: Data
taxonomy under several criteria has described relationships Mining and Knowledge Discovery 5(5), 216–233 (2015). DOI
among these variations, where the main differentiating prop- 10.1002/widm.1157
erties are multiplicity of inputs, multiplicity of outputs, pres- 12. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-
ence of order relations and constraints, and partial informa- label scene classification. Pattern Recognition 37(9), 1757–1771
(2004). DOI 10.1016/j.patcog.2004.03.009
tion. Afterwards, common methods for tackling these prob- 13. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M.,
lems have been outlined and their main applications have Hamilton, N., Hullender, G.: Learning to rank using gradient
been mentioned as well. Finally, some additional variants descent. In: Proceedings of the 22nd international conference
which were left out of the scope of the previous analysis on Machine learning, pp. 89–96. ACM (2005). DOI 10.1145/
1102351.1102363
have been introduced as well. 14. Cardoso, J.S., Sousa, R.: Classification models with global con-
Design of novel algorithms for nonstandard supervised straints for ordinal data. In: 2010 Ninth International Conference
tasks is scarcer than adaptations and transformations, but on Machine Learning and Applications, pp. 71–77. IEEE (2010).
DOI 10.1109/ICMLA.2010.18
there exist some approximations and even more open pos-
15. Chang, K.Y., Chen, C.S., Hung, Y.P.: Ordinal hyperplanes ranker
sibilities for tackling these from classical algorithmic per- with cost sensitivities for age estimation. In: Computer vision
spectives, such as probabilistic and heuristic methods, infor- and pattern recognition (cvpr), 2011 ieee conference on, pp. 585–
mation theory and linear algebra, among others. 592. IEEE (2011). DOI 10.1109/CVPR.2011.5995437
16. Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning,
1st edn. The MIT Press (2010)
Acknowledgements D. Charte is supported by the Spanish Ministry 17. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Quinta: A
of Science, Innovation and Universities under the FPU National Pro- question tagging assistant to improve the answering ratio in elec-
gram (Ref. FPU17/04069). This work has been partially supported by tronic forums. In: EUROCON 2015 - International Conference
projects TIN2017-89517-P (FEDER Founds) of the Spanish Ministry on Computer as a Tool (EUROCON), IEEE, pp. 1–6 (2015).
of Economy and Competitiveness and TIN2015-68454-R of the Span- DOI 10.1109/EUROCON.2015.7313677
ish Ministry of Science, Innovation and Universities. 18. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Dealing with
difficult minority labels in imbalanced mutilabel data sets. Neu-
rocomputing (2017). DOI 10.1016/j.neucom.2016.08.158
19. Chaudhuri, K., Kakade, S.M., Livescu, K., Sridharan, K.: Multi-
References view clustering via canonical correlation analysis. In: Proceed-
ings of the 26th annual international conference on machine
1. Alvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for vector- learning, pp. 129–136. ACM (2009). DOI 10.1145/1553374.
valued functions: A review. In: Foundations and Trends in 1553391
Machine Learning. Now Publishers (2012). DOI 10.1561/ 20. Chen, Q., Sun, S.: Hierarchical multi-view fisher discriminant
2200000036 analysis. In: International Conference on Neural Informa-
2. Amini, M., Usunier, N., Goutte, C.: Learning from multiple par- tion Processing, pp. 289–298. Springer (2009). DOI 10.1007/
tially observed views-an application to multilingual text catego- 978-3-642-10684-2\ 32
rization. In: Advances in neural information processing systems, 21. Cheng, J., Wang, Z., Pollastri, G.: A neural network approach
pp. 28–36 (2009) to ordinal regression. In: Neural Networks, 2008. IJCNN
2008.(IEEE World Congress on Computational Intelligence). Advances in neural information processing systems, pp. 355–362
IEEE International Joint Conference on, pp. 1279–1284. IEEE (2006)
(2008). DOI 10.1109/IJCNN.2008.4633963 39. Fe-Fei, L., et al.: A bayesian approach to unsupervised one-shot
22. Cheng, W., Hüllermeier, E., Dembczynski, K.J.: Graded multi- learning of object categories. In: Computer Vision, 2003. Pro-
label classification: The ordinal case. In: Proceedings of the ceedings. Ninth IEEE International Conference on, pp. 1134–
27th international conference on machine learning (ICML-10), 1141. IEEE (2003). DOI 10.1109/ICCV.2003.1238476
pp. 223–230 (2010) 40. Fernández, A., Garcı́a, S., Galar, M., Prati, R.C., Krawczyk, B.,
23. Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regres- Herrera, F.: Learning from Imbalanced Data Sets. Springer Inter-
sion. Journal of machine learning research 6(Jul), 1019–1041 national Publishing (2018). DOI 10.1007/978-3-319-98074-4
(2005) 41. Foulds, J., Frank, E.: A review of multi-instance learning as-
24. Clare, A., King, R.D.: Knowledge discovery in multi-label phe- sumptions. The Knowledge Engineering Review 25(1), 1–25
notype data. In: European Conference on Principles of Data Min- (2010). DOI 10.1017/S026988890999035X
ing and Knowledge Discovery, pp. 42–53. Springer (2001). DOI 42. Frank, E., Hall, M.: A simple approach to ordinal classification.
10.1007/3-540-44794-6\ 4 In: European Conference on Machine Learning, pp. 145–156.
25. Costa, M.: Probabilistic interpretation of feedforward network Springer (2001). DOI 10.1007/3-540-44795-4\ 13
outputs, with relationships to statistical prediction of ordinal 43. Fukunaga, K.: Introduction to statistical pattern recognition. El-
quantities. International journal of neural systems 7(05), 627– sevier (2013)
637 (1996). DOI 10.1142/S0129065796000610 44. Fürnkranz, J., Hüllermeier, E., Mencı́a, E.L., Brinker, K.: Multil-
26. De Waal, P.R., Van Der Gaag, L.C.: Inference and learning in abel classification via calibrated label ranking. Machine learning
multi-dimensional bayesian network classifiers. In: European 73(2), 133–153 (2008). DOI 10.1007/s10994-008-5064-8
Conference on Symbolic and Quantitative Approaches to Rea- 45. Fürnkranz, J., Hüllermeier, E., Mencı́a, E.L., Brinker, K.: Multil-
soning and Uncertainty, pp. 501–511. Springer (2007). DOI abel classification via calibrated label ranking. Machine learning
10.1007/978-3-540-75256-1\ 45 73(2), 133–153 (2008). DOI 10.1007/s10994-008-5064-8
27. De’Ath, G.: Multivariate regression trees: a new technique for 46. Gama, J.: Knowledge discovery from data streams. Chapman
modeling species–environment relationships. Ecology 83(4), and Hall/CRC (2010)
1105–1117 (2002). DOI 10.1890/0012-9658(2002)083[1105: 47. Geng, X.: Label distribution learning. IEEE Transactions on
MRTANT]2.0.CO;2 Knowledge and Data Engineering 28(7), 1734–1748 (2016).
DOI 10.1109/TKDE.2016.2545658
28. Dekel, O., Singer, Y., Manning, C.D.: Log-linear models for label
48. Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM
ranking. In: Advances in neural information processing systems,
Computing Surveys (CSUR) 47(3), 52 (2015). DOI 10.1145/
pp. 497–504 (2004)
2716262
29. Dembczyński, K., Kotłowski, W., Słowiński, R.: Ensemble of
49. Greco, S., Matarazzo, B., Slowinski, R.: A new rough set ap-
decision rules for ordinal classification with monotonicity con-
proach to evaluation of bankruptcy risk. In: Operational tools in
straints. In: International Conference on Rough Sets and Knowl-
the management of financial risks, pp. 121–136. Springer (1998).
edge Technology, pp. 260–267. Springer (2008). DOI 10.1007/
DOI 10.1007/978-1-4615-5495-0\ 8
978-3-540-79721-0\ 38
50. Greco, S., Matarazzo, B., Słowiński, R.: Rough set approach
30. Deng, W.Y., Zheng, Q.H., Lian, S., Chen, L., Wang, X.: Ordinal
to customer satisfaction analysis. In: International Conference
extreme learning machine. Neurocomputing 74(1-3), 447–456
on Rough Sets and Current Trends in Computing, pp. 284–295.
(2010). DOI 10.1016/j.neucom.2010.08.022
Springer (2006). DOI 10.1007/11908029\ 31
31. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the 51. Gutiérrez, P.A., Garcı́a, S.: Current prospects on ordinal and
multiple instance problem with axis-parallel rectangles. Ar- monotonic classification. Progress in Artificial Intelligence
tificial intelligence 89(1-2), 31–71 (1997). DOI 10.1016/ 5(3), 171–179 (2016). DOI 10.1007/s13748-016-0088-y. URL
S0004-3702(96)00034-3 https://doi.org/10.1007/s13748-016-0088-y
32. Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I.: Protein 52. Gutiérrez, P.A., Pérez-Ortiz, M., Sánchez-Monedero, J.,
classification with multiple algorithms. In: Proc. 10th Panhel- Fernández-Navarro, F., Hervás-Martı́nez, C.: Ordinal regression
lenic Conference on Informatics, Volos, Greece, PCI05, pp. 448– methods: Survey and experimental study. IEEE Transactions
456 (2005). DOI 10.1007/11573036\ 42 on Knowledge and Data Engineering 28(1), 127–146 (2016).
33. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John DOI 10.1109/TKDE.2015.2457911
Wiley & Sons (2012) 53. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification for
34. Duivesteijn, W., Feelders, A.: Nearest neighbour classification multiclass classification and ranking. In: Advances in neural in-
with monotonicity constraints. In: Joint European Conference on formation processing systems, pp. 809–816 (2003)
Machine Learning and Knowledge Discovery in Databases, pp. 54. Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervi-
301–316. Springer (2008). DOI 10.1007/978-3-540-87479-9\ sion and other non-standard classification problems: A taxon-
38 omy. Pattern Recognition Letters 69, 49 – 55 (2016). DOI
35. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster 10.1016/j.patrec.2015.10.008
analysis and display of genome-wide expression patterns. Pro- 55. Herrera, F., Charte, F., Rivera, A.J., Del Jesus, M.J.: Multilabel
ceedings of the National Academy of Sciences 95(25), 14863– classification. Springer (2016)
14868 (1998) 56. Herrera, F., Ventura, S., Bello, R., Cornelis, C., Zafra, A.,
36. Elisseeff, A., Weston, J.: A kernel method for multi-labelled clas- Sánchez-Tarragó, D., Vluymans, S.: Multiple instance learning:
sification. In: Advances in neural information processing sys- foundations and algorithms. Springer (2016). DOI 10.1007/
tems, pp. 681–687 (2002) 978-3-319-47759-6
37. Elkan, C., Noto, K.: Learning classifiers from only positive and 57. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label
unlabeled data. In: Proceedings of the 14th ACM SIGKDD in- ranking by learning pairwise preferences. Artificial Intelligence
ternational conference on Knowledge discovery and data mining, 172(16-17), 1897–1916 (2008). DOI 10.1016/j.artint.2008.08.
pp. 213–220. ACM (2008). DOI 10.1145/1401890.1401920 002
38. Farquhar, J., Hardoon, D., Meng, H., Shawe-taylor, J.S., Szed- 58. Hyndman, R.J., Athanasopoulos, G.: Forecasting: principles and
mak, S.: Two view learning: Svm-2k, theory and practice. In: practice. OTexts (2018)
59. Izenman, A.J.: Reduced-rank regression for the multivariate lin- 77. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance
ear model. Journal of multivariate analysis 5(2), 248–264 (1975). learning. In: Advances in neural information processing systems,
DOI 10.1016/0047-259X(75)90042-1 pp. 570–576 (1998)
60. Jain, A.K., Duin, R.P., Mao, J.: Statistical pattern recognition: 78. Marsland, S.: Machine Learning: An Algorithmic Perspective.
A review. IEEE Transactions on pattern analysis and machine Chapman & Hall (2014)
intelligence 22(1), 4–37 (2000) 79. Micchelli, C.A., Pontil, M.: On learning vector-valued func-
61. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction tions. Neural computation 17(1), 177–204 (2005). DOI 10.1162/
to Statistical Learning: with Applications in R. Springer New 0899766052530802
York, New York, NY (2013). DOI 10.1007/978-1-4614-7138-7 80. Mitchell, T.M.: Machine learning. McGraw Hill series in com-
62. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classifi- puter science. McGraw-Hill (1997)
cation for automated tag suggestion. In: Proc. ECML PKDD08 81. Moya, M.M., Koch, M.W., Hostetler, L.D.: One-class classifier
Discovery Challenge, Antwerp, Belgium, pp. 75–83 (2008) networks for target recognition applications. NASA STI/Recon
63. Kocev, D., Džeroski, S., White, M.D., Newell, G.R., Griffioen, Technical Report N 93 (1993)
P.: Using single-and multi-target regression trees and ensembles 82. Moyano, J.M., Gibaja, E.L., Cios, K.J., Ventura, S.: Review of
to model a compound index of vegetation condition. Ecological ensembles of multi-label classifiers: Models, experimental study
Modelling 220(8), 1159–1168 (2009). DOI 10.1016/j.ecolmodel. and prospects. Information Fusion 44, 33–45 (2018). DOI 10.
2009.01.037 1016/j.inffus.2017.12.001
64. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for 83. Murphy, K.P.: Machine Learning: A Probabilistic Perspective.
predicting structured outputs. Pattern Recognition 46(3), 817– The MIT Press (2012)
833 (2013). DOI 10.1016/j.patcog.2012.09.023 84. Nguyen, C.T., Wang, X., Liu, J., Zhou, Z.H.: Labeling compli-
65. Kotlowski, W., Slowinski, R.: On nonparametric ordinal classi- cated objects: Multi-view multi-instance multi-label learning. In:
fication with monotonicity constraints. IEEE Transactions on AAAI, pp. 2013–2019 (2014)
Knowledge and Data Engineering 25(11), 2576–2589 (2013). 85. Nilsson, N.J.: Learning machines: foundations of trainable
DOI 10.1109/TKDE.2012.204 pattern-classifying systems. McGraw-Hill (1965)
66. Kotsiantis, S., Kanellopoulos, D., Tampakas, V.: Financial appli- 86. Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-
cation of multi-instance learning: two greek case studies. Journal shot learning with semantic output codes. In: Advances in neural
of Convergence Information Technology 5(8), 42–53 (2010) information processing systems, pp. 1410–1418 (2009)
67. Krawczyk, B.: Learning from imbalanced data: open challenges 87. Pan, F.: Multi-dimensional fragment classification in biomedical
and future directions. Progress in Artificial Intelligence 5(4), text. Queen’s University (2006)
221–232 (2016). DOI 10.1007/s13748-016-0094-0. URL 88. Pan, S.J., Kwok, J.T., Yang, Q., Pan, J.J.: Adaptive localization
https://doi.org/10.1007/s13748-016-0094-0 in a dynamic wifi environment through multi-view learning. In:
68. Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spec- AAAI, pp. 1108–1113 (2007)
tral clustering. In: Advances in neural information processing 89. Potharst, R., Feelders, A.J.: Classification trees for problems with
systems, pp. 1413–1421 (2011) monotonicity constraints. ACM SIGKDD Explorations Newslet-
69. Kuznar, D., Mozina, M., Bratko, I.: Curve prediction with kernel ter 4(1), 1–10 (2002). DOI 10.1145/568574.568577
regression. In: Proceedings of the 1st Workshop on Learning 90. Ramon, J., De Raedt, L.: Multi instance neural networks. In:
from Multi-Label Data, pp. 61–68 (2009) Proceedings of the ICML-2000 workshop on attribute-value and
70. Kwon, Y.S., Han, I., Lee, K.C.: Ordinal pairwise partitioning relational learning, pp. 53–60 (2000)
(opp) approach to neural networks training in bond rating. In- 91. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains
telligent Systems in Accounting, Finance & Management 6(1), for multi-label classification. Machine learning 85(3), 333
23–40 (1997). DOI 10.1002/(SICI)1099-1174(199703)6:1h23:: (2011). DOI 10.1007/s10994-011-5256-5
AID-ISAF113i3.0.CO;2-4 92. Ryu, Y.U., Chandrasekaran, R., Jacob, V.S.: Breast cancer pre-
71. Laghmari, K., Marsala, C., Ramdani, M.: An adapted incremen- diction using the isotonic separation technique. European Jour-
tal graded multi-label classification model for recommendation nal of Operational Research 181(2), 842–854 (2007). DOI
systems. Progress in Artificial Intelligence 7(1), 15–29 (2018). 10.1016/j.ejor.2006.06.031
DOI 10.1007/s13748-017-0133-5 93. Sánchez-Fernández, M., de Prado-Cumplido, M., Arenas-Garcı́a,
72. Li, S.Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., Shum, H.: Sta- J., Pérez-Cruz, F.: Svm multiregression for nonlinear channel es-
tistical learning of multi-view face detection. In: European Con- timation in multiple-input multiple-output systems. IEEE trans-
ference on Computer Vision, pp. 67–81. Springer (2002). DOI actions on signal processing 52(8), 2298–2307 (2004). DOI
10.1007/3-540-47979-1\ 5 10.1109/TSP.2004.831028
73. Lin, H.T., Li, L.: Combining ordinal preferences by boosting. 94. Sánchez-Monedero, J., Gutiérrez, P.A., Hervás-Martı́nez, C.:
In: Proceedings ECML/PKDD 2009 Workshop on Preference Evolutionary ordinal extreme learning machine. In: International
Learning, pp. 69–83 (2009) Conference on Hybrid Artificial Intelligence Systems, pp. 500–
74. Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text clas- 509. Springer (2013). DOI 10.1007/978-3-642-40846-5\ 50
sifiers using positive and unlabeled examples. In: Data Mining, 95. Shalev-Shwartz, S., Singer, Y.: A unified algorithmic approach
2003. ICDM 2003. Third IEEE International Conference on, pp. for efficient online label ranking. In: Artificial Intelligence and
179–186. IEEE (2003). DOI 10.1109/ICDM.2003.1250918 Statistics, pp. 452–459 (2007)
75. López-Cruz, P.L., Bielza, C., Larrañaga, P.: Learning conditional 96. Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J.: Multi-
linear gaussian classifiers with probabilistic class labels. In: Con- dimensional classification of biomedical text: Toward automated,
ference of the Spanish Association for Artificial Intelligence, pp. practical provision of high-utility text to diverse users. Bioinfor-
139–148. Springer (2013). DOI 10.1007/978-3-642-40643-0\ matics 24(18), 2086–2093 (2008). DOI 10.1093/bioinformatics/
15 btn381
76. Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding facial 97. Sill, J.: Monotonic networks. In: Advances in neural information
expressions with gabor wavelets. In: Automatic Face and Ges- processing systems, pp. 661–667 (1998)
ture Recognition, 1998. Proceedings. Third IEEE International 98. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Car-
Conference on, pp. 200–205. IEEE (1998). DOI 10.1109/AFGR. valho, A.C., Gama, J.: Data stream clustering: A survey. ACM
1998.670949 Computing Surveys (CSUR) 46(1), 13 (2013)
99. Smola, A.J., Schölkopf, B.: On a kernel-based method for pattern 117. Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach
recognition, regression, approximation, and operator inversion. to multi-label learning. Pattern recognition 40(7), 2038–2048
Algorithmica 22(1-2), 211–231 (1998) (2007). DOI 10.1016/j.patcog.2006.12.019
100. Sousa, R., Gama, J.: Multi-label classification from high-speed 118. Zhang, M.L., Zhou, Z.H.: A review on multi-label learning al-
data streams with adaptive model rules and random rules. gorithms. IEEE transactions on knowledge and data engineering
Progress in Artificial Intelligence 7(3), 177–187 (2018). DOI 26(8), 1819–1837 (2014). DOI 10.1109/TKDE.2013.39
10.1007/s13748-018-0142-z 119. Zhang, W., Liu, X., Ding, Y., Shi, D.: Multi-output ls-svr ma-
101. Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, chine in extended feature space. In: Computational Intelligence
I.: Multi-label classification methods for multi-target regression. for Measurement Systems and Applications (CIMSA), 2012
arXiv preprint arXiv 1211 (2012) IEEE International Conference on, pp. 130–134. IEEE (2012).
102. Sun, S., Chao, G.: Multi-view maximum entropy discrimination. DOI 10.1109/CIMSA.2012.6269600
In: IJCAI, pp. 1706–1712 (2013) 120. Zhao, J., Xie, X., Xu, X., Sun, S.: Multi-view learning overview:
103. Surdeanu, M., Tibshirani, J., Nallapati, R., Manning, C.D.: Recent progress and new challenges. Information Fusion 38, 43–
Multi-instance multi-label learning for relation extraction. In: 54 (2017). DOI 10.1016/j.inffus.2017.02.007
Proceedings of the 2012 joint conference on empirical meth- 121. Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treat-
ods in natural language processing and computational natural ing instances as non-iid samples. In: Proceedings of the 26th
language learning, pp. 455–465. Association for Computational annual international conference on machine learning, pp. 1249–
Linguistics (2012) 1256. ACM (2009). DOI 10.1145/1553374.1553534
104. Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning 122. Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance
structured prediction models: A large margin approach. In: Pro- multi-label learning. Artificial Intelligence 176(1), 2291–2320
ceedings of the 22nd international conference on Machine learn- (2012). DOI 10.1016/j.artint.2011.10.002
ing, pp. 896–903. ACM (2005). DOI 10.1145/1102351.1102464
105. Tax, D.M., Duin, R.P.: Using two-class classifiers for multiclass
classification. In: Pattern Recognition, 2002. Proceedings. 16th
International Conference on, vol. 2, pp. 124–127. IEEE (2002)
106. Thabtah, F.A., Cowling, P., Peng, Y.: Mmac: A new multi-class,
multi-label associative classification approach. In: Data Mining,
2004. ICDM’04. Fourth IEEE International Conference on, pp.
217–224. IEEE (2004). DOI 10.1109/ICDM.2004.10117
107. Tian, Q., Chen, S., Tan, X.: Comparative study among three
strategies of incorporating spatial structures to ordinal image re-
gression. Neurocomputing 136, 152–161 (2014). DOI 10.1016/
j.neucom.2014.01.017
108. Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble
method for multilabel classification. In: European conference on
machine learning, pp. 406–417. Springer (2007). DOI 10.1007/
978-3-540-74958-5\ 38
109. Tuia, D., Verrelst, J., Alonso, L., Pérez-Cruz, F., Camps-Valls,
G.: Multioutput support vector regression for remote sensing
biophysical parameter estimation. IEEE Geoscience and Re-
mote Sensing Letters 8(4), 804–808 (2011). DOI 10.1109/LGRS.
2011.2109934
110. Tzortzis, G., Likas, A.: Kernel-based weighted multi-view clus-
tering. In: Data Mining (ICDM), 2012 IEEE 12th International
Conference on, pp. 675–684. IEEE (2012). DOI 10.1109/ICDM.
2012.43
111. Van Der Merwe, A., Zidek, J.: Multivariate regression analysis
and canonical variates. Canadian Journal of Statistics 8(1), 27–
39 (1980). DOI 10.2307/3314667
112. Vazquez, E., Walter, E.: Multi-output support vector regression.
In: 13th IFAC Symposium on System Identification, pp. 1820–
1825. Citeseer (2003)
113. Vembu, S., Gärtner, T.: Label ranking algorithms: A survey. In:
Preference learning, pp. 45–64. Springer (2010). DOI 10.1007/
978-3-642-14125-6\ 3
114. Wang, J., Zucker, J.D.: Solving multiple-instance problem: a lazy
learning approach. In: International Conference on Machine
Learning, pp. 1119–1126. Morgan Kaufmann Publishers (2000)
115. Williams, C.K., Barber, D.: Bayesian classification with gaussian
processes. IEEE Transactions on Pattern Analysis and Machine
Intelligence 20(12), 1342–1351 (1998)
116. Wu, B., Zhong, E., Horner, A., Yang, Q.: Music emotion recogni-
tion by multi-label multi-layer multi-instance multi-view learn-
ing. In: Proceedings of the 22nd ACM international confer-
ence on Multimedia, pp. 117–126. ACM (2014). DOI 10.1145/
2647868.2654904
Heterogeneous Uncertainty Sampling for Supervised Learning
David D. Lewis and Jason Catlett

AT&T Bell Laboratories
Murray Hill, NJ 07974
lewis@research.att.com, catlett@research.att.com
Appeared (with same pagination) in William W. Cohen and Haym Hirsh, eds.,
Machine Learning: Proceedings of the Eleventh International Conference,
Morgan Kaufmann Publishers, San Francisco, CA, pp. 148–156.
Abstract The type of classifier used in uncertainty sampling must

be cheap to build and to use. At each iteration a new
Uncertainty sampling methods iteratively request classifier is built (fortunately from a small sample) and then
class labels for training instances whose classes applied (unfortunately to a large sample). Our uncertainty
are uncertain despite the previous labeled in- sampling method also requires an estimate of the certainty
stances. These methods can greatly reduce the of classifications (a class-probability value) [28]; not all
number of instances that an expert need label. induction systems provide this.
One problem with this approach is that the clas-
This paper examines a heterogeneous approach in which a
sifier best suited for an application may be too
classifier of one type selects instances for training a classi-
expensive to train or use during the selection of
fier of another type. It is motivated by applications requir-
instances. We test the use of one classifier (a
ing a type of classifier that would be too computationally
highly efficient probabilistic one) to select exam-
expensive to use to select instances. Section 2 reviews re-
ples for training another (the C4.5 rule induction
search on uncertainty sampling. Section 3 points out that
program). Despite being chosen by this heteroge-
the class frequencies in uncertainty samples are severely
neous approach, the uncertainty samples yielded
distorted; the training algorithm should accept some pa-
classifiers with lower error rates than random
rameter to correct for this. The experiments described in
samples ten times larger.
Section 6, on a large text categorization data set, showed our
method for this correction to be effective and robust with
1 Introduction respect to the particular parameter value used. Uncertainty
samples chosen by a probabilistic classifier were found to
be significantly better than random samples ten times larger
Machine learning algorithms have been used to build clas-
when used by a modification of Quinlan’s C4.5 algorithm.
sification rules from data sets consisting of hundreds of
Section 9 lists several opportunites for future work.
thousands of instances [4]. In some applications unlabeled
training instances are abundant but the cost of labeling an
instance with its class is high. In the information retrieval
application described here the class labels are assigned by 2 Background
a human, but they could also be assigned by a computer
simulation [2] or a combination of both [30]. The terms Theoretical analysis and practical experience have shown
oracle and teacher have been used for the source of labels; that a classifier can often be built from fewer instances if the
we will usually call it the expert. learning algorithm is allowed to create artificial instances
or membership queries that are given to an expert to label
Where one of the constraints on the induction process is a
[1, 25]. Unfortunately such queries may create nonsensical
limit on the number of instances presented to the oracle, the
examples: is a pregnant non-smoking male at high risk for
choice of instances becomes important. Random sampling
heart disease? In applications where instances are images
[5] may be ineffective if one class is very rare: all of the
or natural language texts, arbitrary membership queries are
training instances presented may have the majority class.
also implausible.
To make more effective use of the expert’s time, methods
that we collectively call uncertainty sampling label data sets Several algorithms have been proposed that base querying
incrementally, alternating between two phases: presenting on filtering a stream of unlabeled instances rather than on
the expert a few instances to label, and selecting (from creating artificial instances [6, 10, 20, 31]. The expert is
a finite or infinite source) instances whose labels are still asked to label only those instances whose class membership
uncertain despite the indications contained in previously is sufficiently uncertain. Several definitions of uncertainty
labeled data. and sufficiency have been used, but all are based on esti-
1. Obtain an initial classifier 3 Heterogeneous Uncertainty Sampling
2. While expert is willing to label instances
(a) Apply the current classifier to each unlabeled in- Uncertainty sampling requires the construction of large
stance numbers (perhaps thousands) of classifiers which are ap-
plied to very large numbers of examples. This suggests that
(b) Find the instances for which the classifier is
the kind of classifier “in the loop” during sampling should
least certain of class membership
be very cheap both to build and run.
(c) Have the expert label the subsample of instances
(d) Train a new classifier on all labeled instances Unfortunately, an uncertainty sample has strong connec-
tions to the classifier form used to select it: despite con-
taining a disproportionately large number of instances from
Figure 1: An algorithm for uncertainty sampling from a low frequency classes, it still yields an accurate classifier.
finite training set using a single classifier. Some of the characteristics of a sample that cause this to
happen for one form of classifier are likely to have the same
effect on others, such as overrepresentation of instances
where different attribute values suggest different classes.
However, this effect is unlikely to be perfect for classifiers
of any form but the one used in selection. A new classifier
trained on the uncertainty sample will then be unduly biased
mating how likely a classifier consistent with the previously toward predicting low frequency classes.
labeled data would be to produce the correct class label for a
given unlabeled instance. These approaches can be viewed Some mechanism to counterbalance this effect is needed.
as a combination of stratified and sequential approaches to A feature of the CART [3] software for decision trees for
sampling [5, 32], so we refer to them as uncertainty sam- specifying priors on classes could be used, but our appli-
pling. cation required decision rules. We used a version of C4.5
modified by Catlett to accept a parameter specifying the
A simple form of uncertainty sampling is possible for clas- relative cost of two types of error: false positives and false
sifiers that operate by testing a numeric score against a negatives [9, Chapter 1]. We call this number the loss ratio
threshold. A single classifier is trained, and those instances ( ). A loss ratio of 1 indicates that the two errors have
whose scores are closest to that classifier’s threshold are equal costs (the original assumption of C4.5). A loss ratio
good candidates to present to the expert. Where the set of greater than 1 indicates that false positive errors (which a
instances is finite, the single instance with a score closest to classifier built from a training set enriched with positive
the threshold can be found; where the stream of instances is instances is more likely to make) are more costly than false
effectively infinite, one can choose instances whose scores negative errors (where a positive instance is classified nega-
are within some distance of the threshold. The cycle is tive). Setting the loss ratio above 1 will counterbalance the
described in Figure 1 for the finite case. overrepresentation of positive instances, but exactly what
Single classifier approaches to uncertainty sampling have figure should be used? This question motivates a sensitivity
analysis of the effect of this parameter on the accuracy of
been criticized [6, 20] on the grounds that one classifier is
classifiers produced.
not representative of the set of all classifiers consistent with
the labeled data: the version space [24]. The degree to The modifications to C4.5 left the selection criterion un-
which this is a problem in practice has not been established. changed (in contrast to CART’s treatment). When building
trees the original C4.5 checks after each split that this de-
Single classifier approaches have successfully been used
creases the error rate; otherwise it replaces the split with
in generating arbitrary queries [16] and in sampling from
a leaf (build.c, line 347); if not disabled this preempts
labeled data [8, 25]. Uncertainty sampling with a single
the construction of rules for classes with few examples.
classifier can also be viewed as a variation on the heuris-
The class values at the leaves are determined not by major-
tic of training on misclassified instances [15, 33, 35]. A
ity vote but by comparison with a probability threshold of
familiar example of this is windowing, which appeared in
Quinlan’s first paper on ID3 [26], was questioned in [36]

1 (or its reciprocal as appropriate). Pruning
minimizes expected loss instead of estimated errors (sim-
ply multipling by , with the usual correction). A similar
and re-examined in Chapter 6 of the C4.5 book [27]. As
with uncertainty sampling, windowing builds a sequence of
change is made to the minimum error rate required to drop
classifiers, selecting instances to add to the training set at
a rule. The choice of default class is also based on ex-
each iteration. The key difference is its assumption that the
pected loss, but the estimates of the number of examples
class labels of all training instances are known: it examines
left uncovered by any rule appeared too low, so an arbitrary
them in order to choose misclassified examples to add.
factor was introduced to counterbalance this. The most
A large scale test of uncertainty sampling with a single problematic question is how to adapt the sifting of the rules
classifier approach [18] showed that uncertainty sampling for each class, which in C4.5 is guided by MDL principle.
could reduce by a factor of up to 500 the amount of data The current implementation simply multiplies the coding
that had to be labeled to achieve a given level of accuracy. cost of either the false positives or the false negatives by
149
or 1 (as appropriate) to increase the coding cost of Training Test
rulesets that make the more expensive error. Although this Category Number Percent Number Percent
step lacks a theoretical justification, performance appeared tickertalk 208 0.07 40 0.08
satisfactory. boxoffice 314 0.10 42 0.08
bonds 470 0.15 60 0.12
nielsens 511 0.16 87 0.17
4 Task and Data Set burma 510 0.16 93 0.18
dukakis 642 0.20 107 0.21
ireland 780 0.24 117 0.23
The applications motivating this research fall under the quayle 786 0.25 133 0.26
heading of text categorization: the classification of in- budget 1176 0.37 197 0.38
stances composed partly or fully of natural language text hostages 1560 0.49 228 0.44
into pre-specified categories [7, 19]. We have found several
business applications where categorizing text would aid its Table 1: The ten categories used in our experiments, with
use, routing, or analysis. the number and percentage of positive occurrences on train-
ing and test sets.
These texts often reside in large databases supporting
boolean queries [29, pages 231–236], a restricted version
of propositional logic. Because decision rules [27, 34] can
be converted into this form (unlike probabilistic models re-
5 Training C4.5 with Text Data
quiring arithmetic), they make a good choice for the final
classifier. Another important advantage is that they can Although we used a modification of Quinlan’s C4.5 soft-
more comprehensible to humans than decision trees [4]. ware [27] to produce decision rules from the training data,
Our databases contain hundreds of thousands of unlabeled using it to select examples is impractical for large text
instances, so uncertainty sampling is a natural approach. databases because it requires that training and test instances
However, as discussed in Section 5, our current decision be presented as tuples specifying the values of all attributes.
rule induction software cannot practicably be used for un- With 319,463 training instances and 67,331 attributes this
certainty sampling from large text databases. We therefore would have required over 40 gigabytes. The extravagance
decided to test a heterogeneous approach to uncertainty of expanding such spase data was stressed in [22]. The
sampling. C4.5 algorithm could be implemented in a manner suited
to sparse data, but almost no machine learning software has
Given that a key aim of the research is to reduce the time this feature. Even eliminating attributes that take on the
spent by human experts categorizing texts, we could hardly value True less than five times in the training data would
ask them to label a hundred thousand instances for the sake still have left 24,052 attributes, at the cost of eliminating
of our experiments. Instead we used a data set with similar some useful attributes. Feature selection methods requir-
properties to those in our applications: the titles of stories ing class labels are not a solution because most labels are
categorized by a news agency. We collected in electronic unknown.
form the titles of 371,454 articles that appeared on the AP
newswire between 1988 and early 1993. We divided these
randomly into a training set of 319,463 titles and a test set 5.1 Uncertainty Sampling with a Probabilistic
of 51,991 titles.1 Classifier
Titles were converted to lower case and punctuation was Methods for efficient training of probabilistic classifiers
removed. Each distinct word was treated as a binary at- from large, sparse data sets are widely used in information
tribute, resulting in 67,331 attributes. The data matrix was retrieval [14]. We used this type of classifier to select
therefore extremely sparse, with each instance having an instances in uncertainty sampling. The model is described
average of 8.9 nonzero attribute values. in detail elsewhere [18], but in brief it gives the following
The AP data is labeled with several types of subject cate- estimate for the probability that an instance belongs to class
gories. We defined ten binary categories of AP titles based C:
on the “keyword slug line” from the article [13, page 317].
Frequency information on these categories is given in Ta-

ble 1. The categories were chosen to resemble the applica- exp

1 log ¯
tions of interest to us: approximately one in five hundred 1
instances are positive; the classes are somewhat noisy, and 1
exp

1 log
¯

cannnot be perfectly determined from the text.
C indicates class membership, and is the th of attribute
values in the vector w for an instance. The instance is
assigned to class C if exceeds 0 5.
1
Stories were randomly allocated to the test set with probability
0.14. Our goal was a test set with at least 40 to 50 positive instances
of each category. The intuition behind the model is that
150

1. all words occurring in at least 0.2% of the instances,
2 2. all words occurring in two or more positive instances,

1
¯ and
3. all words occuring in one or more of the three initial
is a plausible approximation (exact if certain independence positive instances.
assumptions and class priors hold) to the likelihood ratio
6 Experiment Design

3

¯
Our experiment tested whether heterogeneous uncertainty
and so is a good predictor of class membership. However, it sampling would produce decision rules with significantly
must be scaled to provide an explicit estimate of . lower error rates than those trained on random samples of
One approach to this scaling is logistic regression [23]. the same or even larger size. We also wanted to determine
the sensitivity of the rules’ accuracy to the loss ratio used
Training proceeds as follows. The values and with C4.5. Sources of variability included the categories,
¯ , as well as are estimated for every word quality of starting classifiers, and the vagaries of random

. This estimation uses a sparse representation of the
sampling.
data and requires only a few seconds for several hundred
thousand training instances. Those ’s with large values of We repeated the uncertainty sampling process 100 times, 10
log ¯ were selected as features, trials on each of 10 binary categories, each with a different
a strategy found useful in other text classification problems random set of three initial positive instances. On each run,

[11]. The value 1 ¯ is then computed for each the three initial instances were used to build an initial classi-
fier, after which uncertainty sampling with a subsample size
training instance, and the training data is used again to set
2 of four was run for 249 iterations. This yielded 100 groups
and by a logistic regression.
of 250 uncertainty samples of various sizes. We trained
A classifier of this sort was trained on each iteration of C4.5 rules on two uncertainty samples from each run, one
uncertainty sampling, and was then applied to all unlabeled with 299 instances and the final one (999 instances). Values
training instances. The two instances with the estimated of 1 to 20 for the loss ratio (that is, the ratio of loss incurred
’s closest to, but above, 0.5 were selected, as well for false positives to loss incurred for false negatives) were
as the two instances with ’s closest to, but below tested.
0.5. Using a subsample size of four rather than one was a
As a comparison, C4.5 was applied to samples of size 1,000
compromise for efficiency. Selecting examples both above
and 10,000 produced by adding random instances to the
and below 0.5 was a simple way to halve the potential
same sets of three starting positive instances used to initial-
number of duplicate examples, and may also have benefits
ize uncertainty sampling. (The starting positive instances
for training [16].
are retained to make the comparison more fair to random
sampling.) The samples of size 10,000 were produced by
5.2 Initial Classifier adding additional random examples to the samples of size
1,000. We refer to all these samples as “random”, though
Without an initial classifier our sampling algorithm would they are not completely random. Most of the analyses below
commence with a long period of nearly random sampling use the samples of size 10,000.
before finding any examples of a low frequency class. Ob-
taining a plausible initial classifier is usually easy—it would We were also interested in the difference in accuracy com-
be surprising if an expert were able to classify instances pared to using C4.5 in the instance selection loop. Al-
but could not suggest either some positive and negative in- though it was not practicable to test this directly, we did
stances, or some attribute values correlated with the class. train probabilistic classifiers on both the uncertainty and
For our experiments we instead generated initial classifiers random samples to provide some comparison.
from three positive instances of the category, selected ran-
domly to avoid experimenter bias.
7 Results
5.3 Feature Selection Figure 2 shows average error rates for C4.5 rules trained
with uncertainty samples of size 299 and 999 and various
The cost of specifying the values of all 67,331 attributes for
loss ratios, for each of nine categories. (The tenth category,
even a small training set is so large that some feature se-
tickertalk, resulted in degenerate classifiers—all instances
lection was needed before presenting any data to C4.5. We
classified as category nonmembers—under almost all con-
used the union of the following sets of words as attributes:
ditions.) In all cases, error rates for uncertainty samples
2
Alternatively, one could view this as a one-node neural net of size 999 are close to or better than those for a random
with the input weights set via a probabilistic model rather than by sample of 10,000 instances, provided a loss ratio of three
error propagation. or more is used.
151
boxoffice (0.10%) bonds (0.15%) nielsens (0.16%)
400 400 400

0.50
0.50
0.50
200 200 200
error rate (per cent)

number of errors
number of errors
number of errors
o
oo
o o
o o o o
o
o
o o
0.10
0.10
0.10
o o
50 50 o o o 50
ooo o o o o o
o o
o o o o o o
o
o o o
0.05
0.05
0.05
o
o o o o o o 25 25 25
o o
1 2 3 5 10 20 1 2 3 5 10 20 1 2 3 5 10 20
loss ratio loss ratio loss ratio
burma (0.16%) dukakis (0.20%) ireland (0.24%)
400 400 400

0.50
0.50
0.50
200 200 200

number of errors
number of errors
number of errors
o o o o o
o o ooo o o oo o
oo o o o o o o o o o o o
o o o o o o
o
o
0.10
0.10
0.10
o 50 50 50
o o o o
oo o o o o
0.05
0.05
0.05
25 25 25
1 2 3 5 10 20 1 2 3 5 10 20 1 2 3 5 10 20
quayle (0.25%) budget (0.37%) hostages (0.49%)
400 400 400
o o
0.50
0.50
0.50
o o
o o o o
o o o o o o
o
200 o o 200 o o o o 200
o
oo o o o o o o o
number of errors
number of errors
number of errors
o o
o
o o
oo o o o o o o o o o
0.10
0.10
0.10
50 50 50
0.05
0.05
0.05
25 25 25
1 2 3 5 10 20 1 2 3 5 10 20 1 2 3 5 10 20
Figure 2: Average error rate for C4.5 rules trained on uncertainty samples of size 299 (black dots) and 999 (white dots), at
various loss ratio values. The average error rates for C4.5 rules trained with random samples of size 1,000 (large dashes)
and 10,000 (small dashes) are shown as dashed lines. The percentage of positive instances on the training set follows the
category name; triangles indicate the percentage on the test set.
152
3 + 996 uncertainty 3 + 9997 random
Reject C4.5 ( =5) prob. ( =1) C4.5 ( =1) prob. ( =1)
Category All Average SD Average SD Average SD Average SD
tickertalk 0.077 0.077 (0.000) 0.078 (0.001) 0.078 (0.003) 0.109 (0.044)
boxoffice 0.081 0.047 (0.002) 0.048 (0.008) 0.061 (0.018) 0.077 (0.021)
bonds 0.115 0.064 (0.002) 0.069 (0.006) 0.076 (0.020) 0.145 (0.069)
nielsens 0.167 0.094 (0.011) 0.062 (0.005) 0.107 (0.006) 0.100 (0.026)
burma 0.179 0.090 (0.008) 0.098 (0.006) 0.115 (0.040) 0.193 (0.046)
dukakis 0.206 0.197 (0.014) 0.208 (0.020) 0.210 (0.039) 0.235 (0.036)
ireland 0.225 0.188 (0.005) 0.189 (0.011) 0.220 (0.024) 0.228 (0.016)
quayle 0.256 0.161 (0.009) 0.222 (0.012) 0.143 (0.010) 0.263 (0.035)
budget 0.379 0.336 (0.010) 0.361 (0.009) 0.350 (0.014) 0.392 (0.016)
hostages 0.439 0.415 (0.024) 0.360 (0.016) 0.466 (0.039) 0.431 (0.018)
Table 2: Average and standard deviation of percentage error of various classifiers. Reject all is a classifier that deems all
instances non-members of the category. Two types of training set were used: an uncertainty sample of size 999 and a
random sample of size 10,000. Two types of classifier are built from each training set: a decision rule classifier trained
using C4.5, and the probabilistic classifier described in the text. When C4.5 was used on the uncertainty sample, a loss
ratio of 5 was used; for the random sample a loss ratio of 1 was used (original C4.5). Figures are averages over 20 runs for
classifiers built from random samples using the probabilistic method, and over 10 runs for the other three combinations.
3 + 996 uncertainty 3 + 9997 random

Reject All C4.5 ( =5) prob. ( =1) C4.5 ( =1) prob. ( =1)
Category FP FN FP FN FP FN FP FN FP FN
tickertalk 0.0 40.0 0.0 40.0 1.3 39.3 0.8 39.7 18.3 38.5
boxoffice 0.0 42.0 5.5 19.0 12.6 12.6 5.0 26.8 10.8 29.3
bonds 0.0 60.0 3.6 29.8 7.9 28.3 4.7 34.9 33.6 41.9
nielsens 0.0 87.0 6.0 42.8 9.9 22.2 11.5 44.0 10.6 41.4
burma 0.0 93.0 3.0 43.9 6.0 44.8 5.0 54.6 14.1 86.6
dukakis 0.0 107.0 14.4 88.0 9.5 98.5 68.8 40.1 21.0 101.1
ireland 0.0 117.0 4.8 93.1 16.2 81.9 12.4 101.8 13.8 104.7
quayle 0.0 133.0 23.3 60.2 19.0 96.6 42.3 32.1 17.2 119.4
budget 0.0 197.0 10.6 164.2 29.0 158.5 57.1 124.7 25.7 177.9
hostages 0.0 228.0 30.1 185.6 44.7 142.6 78.3 164.3 25.3 199.0
Table 3: Average number of false positives (FP) and false negatives (FN) for each of 10 categories and 5 conditions.
Experiment conditions are the same as for Table 2.
153
Table 2 lists error rates for both C4.5 and the probabilitistic We believe uncertainty sampling and other sequential, ac-
classifier used during uncertainty sampling. C4.5 figures tive, or exploratory approaches to learning [12, 25] enable
are for a loss ratio of 5 for uncertainty samples and 1 (the both learning research and learning applications on large,
unmodified C4.5) for random samples. The probabilistic complex, real-world data sets where fixed training sets are
classifier uses a loss ratio of 1.0 in both cases. Table 3 impracticable. Natural language processing, where there
shows how the errors divide into false positives and false is great interest in inducing knowledge to support tagging,
negatives. parsing, semantic interpretation, and other forms of analy-
sis, is a particularly fruitful application area.
8 Discussion Heterogeneous approaches are likely to become common,
in response to both resource limitations and the desire to
As Figure 2 shows, an uncertainty sample of 999 instances train new algorithms on previously generated uncertainty
was in most situations as good for training C4.5 rules on samples. A better understanding of how to minimize the
a random sample of 1,000 or even 10,000 instances. At a problems caused by a heterogeneous approach would be
loss ratio of 5, it was even significantly better (p=.03) than desirable.
a random sample of 10,000 instances.3 In some cases, an Note that we treated our large but finite set of instances
uncertainty sample of 299 instances is also as good, though as if it were infinite. By adapting results from sequential
this was less reliable. As expected, it is often necessary to sampling [32] it may be possible both to improve uncer-
use a loss ratio greater than 1 in training rules. Fortunately, tainty sampling and to tell when additional iterations are no
there is some leeway in choosing the loss ratio—good er- longer providing any benefit—when all the juice has been
ror rates are produced for values from 3 to 20 (the highest squeezed out of a data set.
value we tried) for our data. These results show that het-
erogeneous uncertainty sampling can indeed be effective. Finally, in contrast to the assumptions made in most the-
Table 2 presents the data for the larger uncertainty samples oretical work on querying, our categories are stochastic
and random samples in tabular form. rather than deterministic. A classifier may indicate that the
probability of category membership is 0.5 not because the
To point out the extremely low category frequencies, Fig- classifier is incompletely trained, but because the expert
ure 2 and Table 2 also indicate the error rate of a strategy may really classify such instances as category members
that classifies all instances as nonmembers. While such 50% of the time. Indeed, we have seen some evidence of
a strategy has a low error rate, it is not useful. In most such instances being selected in the later iterations of an
cases the classifiers did manage to beat this error rate, and uncertainty sampling run.
an evaluation measure that penalized false negatives would
show an even greater advantage for the trained classifiers. These murky instances are not the best ones for training
[17, 20]. This suggests a goal of producing a classifier that
Table 2 also shows error rates for the probabilistic classifier, estimates accurately rather than simply classifying
both on the samples it selected and on random samples accurately. The variance of this estimate becomes impor-
of size 10,000. C4.5 is significantly better (p=.01) than tant, and it may be more appropriate to treat the problem
the probabilistic classifier on the random sample, but only as one of regression or interpolation [21, 25] rather than
insignificantly better (p=.30) on the uncertainty sample. classification.
This suggests that C4.5 is actually more suitable for this
text categorization task than the probabilistic classifier, and
that there is some penalty in accuracy for heterogeneity in
uncertainty sampling.
10 Summary
Table 3 is similar to Table 2 but shows false positives and
false negatives separately. This shows that while the total
numbers of errors produced by our classifiers were some-
times not substantially smaller than the total number for Using partially formed classifiers to select training data
a strategy that rejects all instances, the errors were more incrementally can reduce the number of instances the expert
balanced between false positives and false negatives. must label to achieve a given error rate. Our experiments
show that some reduction is possible even if this uncertainty
sampling is heterogeneous: the classifiers used to select
9 Future Work instances were of a very different type from the one built
from the final sample. The decision rules C4.5 produced
In this section we discuss a few unexplored directions in from uncertainty samples of roughly 1,000 instances chosen
what we believe is a rich area for study. by a probabilistic classifier were significantly more accurate
than those from random samples ten times larger. The
3
Significance by t-score. The null hypothesis was that differ- ability to use cheap classifiers to select data for training
ences in average error rate across the 10 runs for each category expensive classifiers makes uncertainty sampling even more
were normally distributed with mean zero and a category-specific attractive for a variety of applications where large amounts
variance. of unlabeled data are available.
154
Acknowledgements References
We thank William Cohen, Eileen Fitzpatrick, Yoav Fre- [1] Dana Angluin. Queries and concept learning. Machine
und, William Gale, Trevor Hastie, Doug McIlroy, Robert Learning, 2:319–342, 1988.
Schapire, and Sebastian Seung for advice and useful com- [2] I. Bratko, I. Mozetic, and N. Lavrac. KARDIO: a study
ments on this work, and Ken Church for help with his text in deep and qualitative knowledge for expert systems.
processing tools. MIT Press, Cambridge, Massachusetts, 1989.
[3] Leo Breiman, Jerome H. Friedman, Richard A. Ol-
shen, and Charles J. Stone. Classification and Regres-
sion Trees. Wadsworth, Belmont, CA, 1984.
[4] J. Catlett. Megainduction: a test flight. In Ma-
chine Learning: Proceedings of the Eigth Interna-
tional Workshop, pages 596–599, San Mateo, CA,
1991. Morgan Kaufmann.
[5] William G. Cochran. Sampling Techniques. John
Wiley & Sons, New York, 3rd edition, 1977.
[6] David Cohn, Les Atlas, and Richard Ladner. Improv-
ing generalization with self-directed learning, 1992.
To appear in Machine Learning.
[7] Stuart L. Crawford, Robert M. Fung, Lee A. Appel-
baum, and Richard M. Tong. Classification trees for
information retrieval. In Eighth International Work-
shop on Machine Learning, pages 245–249, 1991.
[8] Daniel T. Davis and Jenq-Neng Hwang. Attentional
focus training by boundary region data selection. In
International Joint Conference on Neural Networks,
pages I–676 to I–681, Baltimore, MD, June 7–11
1992.
[9] James P. Egan. Signal Detection Theory and ROC
Analysis. Academic Press, New York, 1975.
[10] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby.
Information, prediction, and query by committee. In
Advances in Neural Information Processing Systems
5, San Mateo, CA, 1992. Morgan Kaufmann.
[11] William A. Gale, Kenneth W. Church, and David
Yarowsky. A method for disambiguating word senses
in a large corpus. Computers and the Humanities,
26:415–439, 1993.
[12] B. K. Ghosh. A brief history of sequential analysis.
In B. K. Ghosh and P. K. Sen, editors, Handbook of
Sequential Analysis, chapter 1, pages 1–19. Marcel
Dekker, New York, 1991.
[13] Norm Goldstein, editor. The Associated Press Style-
book and Libel Manual. Addison-Wesley, Reading,
MA, 1992.
[14] Donna Harman. Ranking algorithms. In William B.
Frakes and Ricardo Baeza-Yates, editors, Informa-
tion Retrieval: Data Structures and Algorithms, pages
363–392. Prentice Hall, Englewood Cliffs, NJ, 1992.
[15] Peter E. Hart. The condensed nearest neighbor
rule. IEEE Transactions on Information Theory, IT-
14:515–516, May 1968. Reprinted in Agrawala, Ma-
chine Recognition of Patterns, IEEE Press, New York,
1977.
155
[16] Jenq-Neng Hwang, Jai J. Choi, Seho Oh, and Robert J. [32] Bikas Kumar Sinha. Sequential methods for finite
Marks II. Query-based learning applied to partially populations. In B. K. Ghosh and P. K. Sen, editors,
trained multilayer perceptrons. IEEE Transactions on Handbook of Sequential Analysis, chapter 1, pages
Neural Networks, 2(1):131–136, January 1991. 1–19. Marcel Dekker, New York, 1991.
[17] Igor Kononerko, Ivan Bratko, and Esidija Roskar. Ex- [33] Paul E. Utgoff. Improved training via incremental
periments in automatic learning of medical diagnostic learning. In Sixth International Workshop on Machine
rules. Technical report, Jozef Stefan Institute, Ljubl- Learning, pages 362–365, 1989.
jana, Slovenia, 1984. [34] Sholom M. Weiss, Robert S. Galen, and Prasad V.
[18] David D. Lewis and William A. Gale. Training text Tadepalli. Maximizing the predictive value of pro-
classifiers by uncertainty sampling. In Seventeenth duction rules. Artificial Intelligence, 45(1–2):47–71,
Annual International ACM SIGIR Conference on Re- September 1990.
search and Development in Information Retrieval, [35] P. H. Winston. Learning structural descriptions from
1994. To appear. examples. In P. H. Winston, editor, The Psychology of
[19] David D. Lewis and Philip J. Hayes. Editorial. ACM Computer Vision, pages 157–209. McGraw-Hill, New
Transactions on Information Systems. Special Issue York, 1975.
on Text Categorization, 1994. To appear. [36] J. Wirth and J. Catlett. Costs and benefits of window-
[20] David J. C. MacKay. The evidence framework ap- ing in ID3. In Proceedings of the Fifth International
plied to classification networks. Neural Computation, Conference on Machine Learning, Ann Arbor, Michi-
4:720–736, 1992. gan, 1988. Morgan Kaufmann.
[21] David J. C. MacKay. Information-based objective
functions for active data selection. Neural Compu-
tation, 4(4):589–603, 1992.
[22] Michel Manago. Knowledge intensive induction. In
Machine Learning: Proceedings of the Sixth Interna-
tional Workshop, pages 151–155, 1989.
[23] P. McCullagh and J. A. Nelder. Generalized Linear
Models. Chapman & Hall, London, 2nd edition, 1989.
[24] Tom M. Mitchell. Generalization as search. Artificial
Intelligence, 18:203–226, 1982.
[25] Mark Plutowski and Halbert White. Selecting concise
training sets from clean data. IEEE Transactions on
Neural Networks, 4(2):305–318, March 1993.
[26] J. R. Quinlan. Discovering rules by induction from
large collections of examples. In Expert systems in
the micro-electronic age, Edinburgh, UK, 1979. Ed-
inburgh University Press.
[27] J. Ross Quinlan. C4.5: Programs for Machine Learn-
ing. Morgan Kaufmann, San Mateo, CA, 1993.
[28] J.R. Quinlan. Decision trees as probabilistic clas-
sifiers. In Proceedings of the Fourth International
Workshop on Machine Learning, pages 31–37, Irvine,
California, 1987.
[29] Gerard Salton. Automatic Text Processing: The Trans-
formation, Analysis, and Retrieval of Information by
Computer. Addison-Wesley, Reading, MA, 1989.
[30] Claude Sammut, Scott Hurst, Dana Kedzier, and
Donald Michie. Learning to fly. In Ninth Interna-
tional Workshop on Machine Learning, pages 385–
393, 1992.
[31] H. S. Seung, M. Opper, and H. Sompolinsky. Query
by committee. In Proceedings of the Fifth Annual
ACM Workshop on Computational Learning Theory,
pages 287–294, 1992.
156
Wasserstein Propagation for Semi-Supervised Learning
Justin Solomon JUSTIN . SOLOMON @ STANFORD . EDU

Raif M. Rustamov RUSTAMOV @ STANFORD . EDU
Leonidas Guibas GUIBAS @ CS . STANFORD . EDU
Department of Computer Science, Stanford University, 353 Serra Mall, Stanford, California 94305 USA
Adrian Butscher ADRIAN . BUTSCHER @ GMAIL . COM
Max Planck Center for Visual Computing and Communication, Campus E1 4, 66123 Saarbrücken, Germany
Abstract Niyogi (2001); Zhu et al. (2003); Belkin et al. (2006); Zhou
Probability distributions and histograms are nat- & Belkin (2011); Ji et al. (2012) (also see the survey by Zhu
ural representations for product ratings, traffic (2008) and references therein), can be applied bin-by-bin to
measurements, and other data considered in many propagate normalized frequency counts, this strategy does
machine learning applications. Thus, this pa- not model interactions between histogram bins. As a result,
per introduces a technique for graph-based semi- a fundamental aspect of this type of data is ignored, leading
supervised learning of histograms, derived from to artifacts even when propagating Gaussian distributions.
the theory of optimal transportation. Our method Among first works directly addressing semi-supervised
has several properties making it suitable for this learning of probability distributions is Subramanya &
application; in particular, its behavior can be char- Bilmes (2011), which propagates distributions represent-
acterized by the moments and shapes of the his- ing class memberships. Their loss function, however, is
tograms at the labeled nodes. In addition, it can be based on Kullback-Leibler divergence, which cannot cap-
used for histograms on non-standard domains like ture interactions between histogram bins. Talukdar & Cram-
circles, revealing a strategy for manifold-valued mer (2009) allow interactions between bins by essentially
semi-supervised learning. We also extend this modifying the underlying graph to its tensor product with a
technique to related problems such as smoothing prescribed bin interaction graph; this approach loses prob-
distributions on graph nodes. abilistic structure and tends to oversmooth. Similar issues
have been encountered in the mathematical literature (Mc-
Cann, 1997; Agueh & Carlier, 2011) and in vision/graphics
1. Introduction applications (Bonneel et al., 2011; Rabin et al., 2012) involv-
Graph-based semi-supervised learning is an effective ap- ing interpolating probability distributions. Their solutions
proach for learning problems involving a limited amount attempt to find weighted barycenters of distributions, which
of labeled data (Singh et al., 2008). Methods in this class is insufficient for propagating distributions along graphs.
typically propagate labels from a subset of nodes of a graph The goal of our work is to provide an efficient and theoreti-
to the rest of the nodes. Usually each node is associated cally sound approach to graph-based semi-supervised learn-
with a real number, but in many applications labels are more ing of probability distributions. Our strategy uses the ma-
naturally expressed as histograms or probability distribu- chinery of optimal transportation (Villani, 2003). Inspired
tions. For instance, the traffic density at a given location by (Solomon et al., 2013), we employ the two-Wasserstein
can be seen as a histogram over the 24-hour cycle; these distance between distributions to construct a regularizer
densities may be known only where a service has cameras measuring the “smoothness” of an assignment of a proba-
installed but need to be propagated to the entire map. Prod- bility distribution to each graph node. The final assignment
uct ratings, climatic measurements, and other data sources is produced by optimizing this energy while fitting the his-
exhibit similar structure. togram predictions at labeled nodes.
While methods for numerical labels, such as Belkin & Our technique has many notable properties. As certainty in
st
Proceedings of the 31 International Conference on Machine the known distributions increases, it reduces to the method
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- of label propagation via harmonic functions (Zhu et al.,
right 2014 by the author(s). 2003). Also, the moments and other characteristics of the
Wasserstein Propagation
propagated distributions are well-characterized by those

of the labeled nodes at minima of our smoothness energy.
Our approach does not restrict the class of the distributions
provided at labeled nodes, allowing for bi-modality and
other non-Gaussian properties. Finally, we prove that under
an appropriate change of variables our objective can be
minimized using a fast linear solve.
Figure 1. Propagating prescribed probability distributions (in red)
Overview We first motivate the problem of propagating to interior nodes of path graph identified with the interval [0, 1]:
distributions along graphs and show why naı̈ve techniques (a) naive approach; (b) statistical approach; (c) desirable output.
are ineffective (§2). Given this setup, we develop the Wasser-
stein propagation technique (§3) and discuss its theoretical • The spread of the propagated distributions should be
properties (§3.1). We also show how it can be used to related to the spread of the prescribed distributions.
smooth distribution-valued maps from graphs (§3.2) and
extend it to more general domains (§4). Finally, after provid- • As the prescribed distributions in V0 become peaked
ing algorithmic details (§5) we demonstrate our techniques (concentrated around the mean), the propagated dis-
on both synthetic (§6.1) and real-world (§6.2) data. tributions should become peaked around the values
obtained by propagating means of prescribed distribu-
tions via label propagation (e.g. Zhu et al. (2003)).
2. Preliminaries and Motivation
2.1. Label Propagation on Graphs • The computational complexity of distribution propaga-
tion should be similar to that of scalar propagation.
We consider generalization of the problem of label prop-
agation on a graph G = (V, E). Suppose a label func- The simplest method for propagating probability distribu-
tion f is known on a subset of vertices V0 ⊆ V , and we tions is to extend Zhu et al. (2003) naı̈vely. For each x ∈ R,
wish to extend f to the remainder V \V0 . The classical we can view ρv (x) as a label at v ∈ V and solve the Dirich-
approach of ZhuPet al. (2003) minimizes the Dirichlet en- let problem ∆ρv (x) = 0 with ρv0 (x) prescribed for all
ergy ED [f ] := (v,w)∈E ωe (fv − fw )2 over the space of v ∈ V0 . The resulting functions ρv (x) are distributions be-
functions taking the prescribed values on V0 . Here ωe is
´ maximum principle guarantees ρv (x) ≥ 0 for all
cause the
the weight associated to the edge e = (v, w). ED is a x and R ρv (x) dx = 1 for all v ∈ V since these properties
measure of smoothness; therefore the minimizer matches hold at the boundary (Chung et al., 2007).
the prescribed labels with minimal variation in between.
Minimizing this quadratic objective is equivalent to solv- It is easy to see, however, that this method has short-
ing ∆f = 0 on V \V0 for an appropriate positive definite comings. For instance, consider the case where G is
Laplacian matrix ∆ (Chung & Yau, 2000). Solutions of this a path graph representing the segment [0, 1] and the la-
system are well-known to enjoy many regularity properties, beled vertices are the endpoints, V0 = {0, 1}. In this
making it a sound choice for smooth label propagation. case, the naı̈ve approach results in the linear interpolation
ρt (x) := (1 − t)ρ0 (x) + tρ1 (x) at all intermediate graph
vertices for t ∈ (0, 1). The propagated distributions are
2.2. Propagating Probability Distributions
thus bimodal as in Figure 1a. Given our criteria, however,
Suppose, however, that each vertex in V0 is decorated with we would prefer an interpolation result closer to Figure 1c,
a probability distribution rather than a real number. That which causes the peak in the boundary data simply to slide
is, for each v ∈ V0 , we are given a probability distribution from left to right without introducing variance as t changes.
ρv ∈ Prob(R). Our goal now is to propagate these distri-
An alternative strategy for propagating probability distribu-
butions to the remaining vertices, generating a distribution-
tions over V given boundary data on V0 is to use a statistical
valued map ρ : v ∈ V 7→ ρv ∈ Prob(R) associating a
approach. We could repeatedly draw an independent sam-
probability distribution with every vertex ´ v ∈ V . It must ple from each distribution in {ρv : v ∈ V0 } and propagate
satisfy ρv (x) ≥ 0 for all x ∈ R and R ρv (x) dx = 1.
the resulting scalars using a classical approach; binning the
In §4 we consider the generalized case ρ : V → Prob(Γ)
results of these repeated experiments provides a histogram-
for alternative domains Γ including subsets of Rn ; most of
style distribution at each vertex in V . This strategy has
the statements we prove about maps into Prob(R) extend
a similar shortcomings to the naı̈ve approach above. For
naturally to this setting with suitable technical adjustments.
instance, in the path graph example, the interpolated distri-
In the applications we consider, such a propagation process bution is trimodal as in Figure 1b, with nonzero probability
should satisfy a number of properties: at both endpoints and for some v in the interior of V .
Of course, the desiderata above are application-specific. WASSERSTEIN P ROPAGATION

One key assumption is that the spread of the distributions
Minimize ED [ρ] in the space of distribution-valued
is preserved, which differs from existing approaches which
maps with prescribed distributions at all v ∈ V0 .
tend to blur the distributions. While this property is not
intrinsically superior, in a way the experiments in §6 validate
not only the algorithmic effectiveness of our technique but 3.1. Theoretical Properties
also this assumption about probabilistic data on graphs. Solutions of the Wasserstein propagation problem satisfy
many desirable properties that we will establish below. Be-
3. Wasserstein Propagation fore proceeding, however, we recall a fact about the Wasser-
stein distance. Let ρ ∈ Prob(R) be a probability distribution.
Ad hoc methods for propagating distributions based on meth- Then its cumulative distribution function (CDF) is given by
´x
ods for scalar functions tend to have a number of drawbacks. F (x) := −∞ ρ(y) dy, and the generalized inverse of the
Therefore, we tackle this problem using a technique de- its CDF is given by F −1 (s) := inf{x ∈ R : F (x) > s}.
signed explicitly for the probabilistic setting. To this end, Then the following result holds.
we formulate the semi-supervised problem at hand as the
optimization of a Dirichlet energy for distribution-valued Proposition 1. [Villani (2003), Theorem 2.18] Let ρ0 , ρ1 ∈
maps generalizing the classical Dirichlet energy. Prob(R) with CDFs F0 , F1 . Then
Similar to the construction in (Subramanya & Bilmes, 2011), ˆ 1
we replace the square distance between scalar function val- W22 (ρ0 , ρ1 ) = (F1−1 (s) − F0−1 (s))2 ds . (2)
0
ues appearing in the classical Dirichlet energy (namely the
quantity |fv − fw |2 ) with an appropriate distance between
By applying (2) to the minimization problem (1), we obtain
the distributions ρv and ρw . Rather than using the bin-by-bin
a linear strategy for our propagation problem.
KL divergence, however, we use the Wasserstein distance
with quadratic cost between probability distributions with Proposition 2. Wasserstein propagation can be character-
finite second moment on R. This distance is defined as ized in the following way. For each v ∈ V0 let Fv be the
¨ 1/2 CDF of the distribution ρv . Now suppose that for each
W2 (ρv , ρw ) := inf |x − y|2 dπ(x, y) s ∈ [0, 1] we determine gs : V → R as the solution of the
π∈Π(ρv ,ρw ) R2 classical Dirichlet problem
where Π(ρ0 , ρ1 ) ⊆ Prob(R2 ) is the set of probability distri- ∆gs = 0 ∀ v ∈ V \ V0

butions π on R2 satisfying the marginal constraints (3)
gs (v) = Fv−1 (s) ∀ v ∈ V0 .
ˆ 1 ˆ 1
π(x, y) dx = ρw (y) and π(x, y) dy = ρv (x) . Then for each v, the function s 7→ gs (v) is the inverse CDF
0 0 of a probability distribution ρv . Moreover, the distribution-
The Wasserstein distance is a well-known distance metric valued map v 7→ ρv minimizes the Dirichet energy (1).
for probability distributions, sometimes called the quadratic
Earth Mover’s Distance, and is studied in the field of optimal Proof. Let X be the set of functions g : V × [0, 1] → R
transportation. It measures the optimal cost of transporting satisfying the constraints gs (v) = Fv−1 (s) for all s ∈ [0, 1]
one distribution to another, given that the cost of transporting and all v ∈ V0 . Consider the minimization problem
a unit amount of mass from x to y is |x − y|2 . W2 (ρv , ρw )
X ˆ 1
takes into account not only the values of ρv and ρw but
min ÊD (g) := (gs (u) − gs (v))2 ds .
also the ground distance in the sample space R. It already g∈X 0
(u,v)∈E
has shown promise for search and clustering techniques (Ir-
pino et al., 2011; Applegate et al., 2011) and interpolation The solution of this optimization for each s is exactly a solu-
problems in graphics and vision (Bonneel et al., 2011). tion of the classical Dirichlet problem (3) on G. Moreover,
With these ideas in place, we define a Dirichlet energy for a the maximum principle implies that gs (v) ≤ gs′ (v) when-
distribution-valued map from a graph into Prob(R) by ever s < s′ , which holds by definition for all v ∈ V0 , can be
extended to all v ∈ V (Chung et al., 2007). Hence gs (v) can
be interpreted as an inverse CDF for each v ∈ V form which
X
ED [ρ] := W22 (ρv , ρw ) , (1)
(v,w)∈E we can define a distribution-valued map ρ : v 7→ ρv . Since
ÊD takes on its minimum value in the subset of X consisting
along with the notion of Wasserstein propagation of of inverse CDFs, and ÊD coincides with ED on this set, ρ is
distribution-valued maps given prescribed boundary data. a solution of the Wasserstein propagation problem.
Distribution-valued maps ρ : V → Prob(R) propagated by of the classical Dirichlet problem and the Wasserstein prop-
optimizing (1) satisfy many analogs of functions extended agation problem coincide in the following way. Suppose that
using the classical Dirichlet problem. Two results of this f : V → R satisfies the classical Dirichlet problem with
kind concern the mean m(v) and the variance σ(v) of the boundary data u. Then ρv (x) := δ(x − f (v)) minimizes (1)
distributions ρv as functions of V . These are defined as subject to the fixed boundary constraints.
ˆ ∞
m(v) := xρv (x) dx Proof. The boundary data for ρ given here yields the bound-
−∞ ary data gs (v) = u(v) for all v ∈ V0 and s ∈ [0, 1) in
ˆ ∞
σ 2 (v) := (x − m(v))2 ρv (x) dx . the Dirichlet problem (3). The solution of this Dirichlet
−∞ problem is thus also constant in s, let us say gs (v) = f (v)
for all s ∈ [0, 1) and v ∈ V . The only distributions whose
Proposition 3. Suppose the distribution-valued map ρ : inverse CDFs are of this form are δ-distributions; hence
V → Prob(R) is obtained using Wasserstein propagation. ρv (x) = δ(x − f (v)) as desired.
Then for all v ∈ V the following estimates hold.
• inf v0 ∈V0 m(v0 ) ≤ m(v) ≤ supv0 ∈V0 m(v0 ). 3.2. Application to Smoothing
• 0 ≤ σ(v) ≤ supv0 ∈V0 σ(v0 ). Using the connection to the classical Dirichlet problem in
Proposition 2 we can extend our treatment to other dif-
Proof. Both estimates can be derived from the following ferential equations. There is a large space of differential
formula. Let ρ ∈ Prob(R) and let φ : R → R be any equations that have been adapted to graphs via the discrete
integrable function. If we apply the change of variables Laplacian ∆; here we focus on the heat equation, considered
s = F (x) where F is the CDF of ρ in the integral defining e.g. in Chung et al. (2007).
the expectation value of φ with respect to ρ, we get The heat equation for scalar functions is applied to smooth-
ˆ ∞ ˆ 1
ing problems; for example, in Rn solving the heat equation
φ(x)ρ(x) dx = φ(F −1 (s)) ds . is equivalent to Gaussian convolution. Just as the Dirichlet
−∞ 0 equation on F −1 is equivalent to Wasserstein propagation,
´1 ´1 heat diffusion on F −1 is equivalent to gradient flows of
Thus m(v) = 0 Fv−1 (s) ds and σ 2 (v) = 0 (Fv−1 (s) − the energy ED in (1), providing a straightforward way to
m(v))2 ds where Fv is the CDF of ρv for each v ∈ V . understand and implement such a diffusive process.
Assume ρ minimizes (1) with fixed boundary constraints Proposition 5. Let ρ : V → Prob(R) be a distribution-
on V0 . By Proposition 2, we then have ∆Fv−1 = 0 for all valued map and let Fv : [0, 1] → R be the CDF of ρv for
´1
v ∈ V . Therefore ∆m(v) = 0 ∆Fv−1 (s) ds = 0, so m is each v ∈ V . Then these two procedures are equivalent:
a harmonic function on V . The estimates for m follow by
• Mass-preserving flow of ρ in the direction of steepest
the maximum principle for harmonic functions. Also,
descent of the Dirichlet energy.
ˆ 1
∆[σ 2 (v)] = ∆(Fv−1 (s) − m(v))2 ds • Heat flow of the inverse CDFs.
0
ˆ 1 2 Proof. A mass-preserving flow of ρ is a family of
X
= a(v, s) − a(v ′ , s) ds
(v,v ′ )∈E 0 distribution-valued maps ρε : V → Prob(R) with ε ∈
(−ε0 , ε0 ) that satisfies the equations
≥0 — where a(v, s) := Fv−1 (s) − m(v),
∂ρv,ε (t)

∂
since ∆Fv−1 (s) = ∆m(v) = 0. Thus σ 2 is a subharmonic + Yv (ε, t)ρv,ε (t) = 0 
∂ε ∂t ∀v ∈ V
function and the upper bound for σ 2 follows by the maxi-
ρv,0 (t) = ρv (t)

mum principle for subharmonic functions.
where Yv : (−ε0 , ε0 ) × R → R is an arbitrary function
Finally, we check that if we encode a classical interpola- that governs the flow. By applying the change of variables
tion problem using Dirac delta distributions, we recover −1
t = Fv,ε (s) using the inverse CDFs of the ρv,ε , we find that
the classical solution. The essence of this result is that this flow is equivalent to the equations
if the boundary data for Wasserstein propagation has zero
variance, then the solution must also have zero variance. −1
∂Fv,ε (s)

−1
= Yv (ε, Fv,ε (s))

Proposition 4. Suppose that there exists u : V0 → R such ∂ε ∀v ∈ V .
that ρv (x) = δ(x−u(v)) for all v ∈ V0 . Then, the solutions −1 −1
Fv,0 (s) = Fv (s)

A short calculation starting from (1) now leads to the deriva- ρvi ≥ 0 ∀v ∈ V, i ∈ S xij ≥ 0 ∀i, j ∈ S
tive of the Dirichlet energy under such a flow, namely
where S = {1, . . . , m}.
dED (ρε ) Xˆ 1
−1 −1
= −2 ∆(Fv,ε ) · Yv (ε, Fv,ε (s)) ds .
dε 0 v∈V 5. Algorithm Details
Thus, steepest descent for the Dirichlet energy is achieved We handle the general case from §4 by optimizing the linear
−1
by choosing Yv (ε, Fv,ε (s)) := ∆(Fv,ε (s)) for each v, ε, s. programming formulation directly. Given the size of these
−1
As a result, the equation for the evolution of Fv,ε becomes linear programs, we use large-scale barrier method solvers.
−1
∂Fv,ε (s)

−1 The characterizations in Propositions 2 and 5, however, sug-
= ∆(Fv,ε (s))

∂ε ∀v ∈ V gest a straightforward discretization and accompanying set
−1
Fv,0 (s) = Fv−1 (s)
 of optimization algorithms in the linear case. In fact, we
can recover propagated distributions by inverting the graph
−1
which is exactly heat flow of Fv,ε . Laplacian ∆ via a sparse linear solve, leading to near-real-
time results for moderately-sized graphs G.
4. Generalization For a given graph G = (V, E) and subset V0 ⊆ V , we
discretize the domain [0, 1] of Fv−1 for each v using a set
Our preceding discussion involves distribution-valued maps
of evenly-spaced samples s0 = 0, s1 , . . . , sm = 1. This
into Prob(R), but in a more general setting we might wish
representation supports any ρv provided it is possible to
to replace Prob(R) with Prob(Γ) for an alternative domain
sample the inverse CDF from Proposition 1 at each si . In
Γ carrying a distance metric d. Our original formulation
particular, when the underlying distributions are histograms,
of Wasserstein propagation easily handles such an exten-
we model ρv using δ functions at evenly-spaced bin cen-
sion by replacing |x − y|2 with d(x, y)2 in the definition of
ters, which have piecewise constant CDFs; we model con-
W2 . Furthermore, although proofs in this case are consider-
tinuous ρv using piecewise linear interpolation. Regard-
ably more involved, some key properties proved above for
less, in the end we obtain a non-decreasing set of samples
Prob(R) extend naturally.
(F −1 )1v , . . . , (F −1 )m
v with (F
−1 1
)v = 0 and (F −1 )m
v = 1.
In this case, we no longer can rely on the computational
Now that we have sampled Fv−1 for each v ∈ V0 , we can
benefits of Propositions 2 and 5 but can solve the propaga-
propagate to the remainder V \V0 . For each i ∈ {1, . . . , m},
tion problem directly. If Γ is discrete, then Wasserstein dis-
we solve the system from (3):
tances between ρv ’s can be computed using a linear program.
Suppose we represent two histograms P as {a1 , . . P
. , am } and ∆g = 0 ∀ v ∈ V \ V0
{b1 , . . . , bm } with ai , bi ≥ 0 ∀i and i ai = i bi = 1. (5)
Then, the definition of W2 yields the optimization: g(v) = (F −1 )iv ∀ v ∈ V0 .
X
W22 ({ai }, {bj }) = min d2ij xij (4) In the diffusion case, we replace this system with implicit
ij time stepping for the heat equation, iteratively applying
X X (I − t∆)−1 to g for diffusion time step t. In either case, the
s.t. xij = ai ∀i xij = bj ∀j xij ≥ 0 ∀i, j
linear solve is sparse, symmetric, and positive definite; we
j i
apply Cholesky factorization to solve the systems directly.
Here dij is the distance from bin i to bin j, which need not
This process propagates F −1 to the entire graph, yielding
be proportional to |i − j|.
samples (F −1 )iv for all v ∈ V . We invert once again to
From this viewpoint, the energy ED from (1) remains convex yield samples ρiv for all v ∈ V . Of course, each inversion
in ρ and can be optimized using a linear program simply by incurs some potential for sampling and discretization error,
summing terms of the form (4) above: but in practice we are able to oversample sufficiently to
XX (e)
overcome most potential issues. When the inputs ρv are
min d2ij xij discrete histograms, we return to this discrete representation
ρ,x
e∈E ij by integrating the resulting ρv ∈ Prob([0, 1]) over the width
(e)
X
s.t. xij = ρvi ∀e = (v, w) ∈ E, i ∈ S of the bin about the center defined above.
j
This algorithm is efficient even on large graphs and is easily
(e)
X
xij = ρwj ∀e = (v, w) ∈ E, j ∈ S parallelizable. For instance, the initial sampling steps for
i obtaining F −1 from ρ are parallelizable over v ∈ V0 , and
the linear solve (5) can be parallelized over samples i. Direct
X
ρvi = 1 ∀v ∈ V ρvi fixed ∀v ∈ V0
i
solvers can be replaced with iterative solvers for particularly
(a)
Figure 2. Comparison of propagation strategies on a linear graph

(coarse version on left); each horizontal slice represents a vertex
v ∈ V , and the colors from left to right in a slice show ρv . (Sub-
ramanya & Bilmes, 2011) (KL) is shown only in one example
(b) (c)
because it has qualitatively similar behavior to the PDF strategy.
Figure 3. PDF (b) and Wasserstein (c) propagation on a meshed
circle with prescribed boundary distributions (a). The underlying
large graphs G; regardless, the structure of such a solve is graph is shown in grey, and probability distributions at vertices
well-understood and studied, e.g. in Krishnan et al. (2013). v ∈ V are shown as vertical bars colored by the density ρv ; we
invert the color scheme of Figures 2 and 4 to improve contrast.
6. Experiments Propagated distributions in (b) and (c) are computed for all vertices
but for clarity are shown at representative slices of the circle.
We run our scheme through a number of tests demonstrating
its strengths and weaknesses compared to other potential
methods for propagation. We compare Wasserstein propaga-
tion with the strategy of propagating probability distribution
functions (PDFs) directly, as described in §2.2.
6.1. Synthetic Tests

(a) (b)
We begin by considering the behavior of our technique on
synthetic data designed to illustrate its various properties. Figure 4. Comparison of PDF diffusion (a) and Wasserstein dif-
One-Dimensional Examples Figure 2 shows “displace- fusion (b); in both cases the leftmost distribution comprises the
initial conditions, and several time steps of diffusion are shown
ment interpolation” properties inherited by our propagation
left-to-right. The underlying graph G is the circle on the left.
technique from the theory of optimal transportation. The
underlying graph is a line as in Figure 1, along the vertical
axis. Horizontally, each image is colored by values in ρv . unit circle, and we propagate ρv from fixed distributions
on the boundary. Unlike the classical case, however, our
The bottom and top vertices v0 and v1 have fixed distribu-
prescribed boundary distributions ρv are multimodal. Once
tions ρv0 and ρv1 , and the remaining vertices receive ρv
again, Wasserstein propagation recovers a smoothly-varying
via one of two propagation techniques. The left of each
set of distributions whose peaks behave like solutions to
pair propagates distributions by solving a classical Dirichlet
the classical Dirichlet problem. Propagating probability di-
problem independently for each bin of the probability dis-
rections rather than inverse CDFs yields somewhat similar
tribution function (PDF) ρv , whereas the right of each pair
modes, but with much higher entropy and variance espe-
propagates inverse CDFs using our method in §5.
cially at the center of the circle.
By examining the propagation behavior from the bottom to
the top of this figure, it is easy to see how the naı̈ve PDF Diffusion Figure 4 illustrates the behavior of Wasserstein
method varies from Wasserstein propagation. For instance, diffusion compared with simply diffusing distribution val-
in the leftmost example both ρv0 and ρv1 are unimodal, yet ues directly. When PDF values are diffused directly, as time
when propagating PDFs all the intermediate vertices have t increases the distributions simply become more and more
bimodal distributions; furthermore, no relationship is deter- smooth until they are uniform not only along G but also as
mined between the two peaks. Contrastingly, our technique distributions on Prob([0, 1]). Contrastingly, Wasserstein dif-
identifies the modes of ρv0 and ρv1 , linearly moving the fusion preserves the uncertainty from the initial distributions
peak from one side to the other. but does not increase it as time progresses.
Boundary Value Problems Figure 3 illustrates our algo- Alternative Target Domain Figure 5 shows an example
rithm on a less trivial graph G. To mimic a typical test case in which the target is Prob(S1 ), where S1 is the unit cir-
for classical Dirichlet problems, our graph is a mesh of the cle, rather than Prob([0, 1]). We optimize the ED using the
(c) of ρv from PDF and Wasserstein propagation. Both

yield similar mean temperatures on V \V0 , which agree with
the means of the ground truth data. The standard devi-
ations, however, better illustrate differences between the
(a) (b) approaches. In particular, the standard deviations of the
Wasserstein-propagated distributions approximately follow
Figure 5. Interpolation of distributions on S1 via (a) PDF propaga- those of the ground truth histograms, whereas the PDF strat-
tion and (b) Wasserstein propagation; in these figures the vertices egy yields high standard deviations nearly everywhere on
with valence 1 have prescribed distributions ρv and the remaining the map due to undesirable smoothing effects.
vertices have distributions from propagation.
Wind Directions We apply the general formulation in §4
to propagating distributions on the unit circle S1 by consid-
linear program in §4 rather than the linear algorithm for ering histograms of wind directions collected over time by
Prob([0, 1]). Conclusions from this example are similar nodes on the ocean outside of Australia.2
to those from Figure 3: Wasserstein propagation identifies
peaks from different prescribed boundary distributions with- In this experiment, we keep approximately 4% of the data
out introducing variance, while PDF propagation exhibits points and propagate to the remaining vertices. Both the
much higher variance in the interpolated distributions and PDF and Wasserstein propagation strategies score similarly
does not “move” peaks from one location to another. with respect to our error metric; in the experiment shown,
Wasserstein propagation exhibits 6.6% average error per
6.2. Real-World Data node and PDF propagation exhibits 6.1% average error per
node. Propagation results are illustrated in Figure 7a.
We now evaluate our techniques on real-world input. To
evaluate the quality of our approach relative to ground truth, The nature of the error from the two strategies, however, is
we will use the one-Wasserstein distance, or Earth Mover’s quite different. In particular, Figure 7b shows the same map
Distance (Rubner et al., 2000), formulated by removing the colored by the entropy of the propagated distributions. PDF
square in the formula for W22 . We use this distance, given on propagation exhibits high entropy away from the prescribed
Prob(R) by the L1 distance between (non-inverted) CDFs, vertices, reflecting the fact that the propagated distributions
because it does not favor the W2 distance used in Wasser- at these points approach uniformity. Wasserstein propaga-
stein propagation while taking into account the ground dis- tion, on the other hand, has a more similar pattern of entropy
tances. We consider weather station coordinates as defining to that of the ground truth data, reflecting structure like that
a point cloud on the plane and compute the point cloud demonstrated in Proposition 3.
Laplacian using the approach of (Coifman & Lafon, 2006).
Non-Euclidean Interpolation Proposition 4 suggests an
Temperature Data Figure 6 illustrates the results of a application outside histogram propagation. In particular, if
series of experiments on weather data on a map of the United the vertices of V0 have prescribed distributions that are δ
States.1 Here, we have |V | = 1113 sites each collecting functions encoding individual points as mapping targets, all
daily temperature measurements, which we classify into propagated distributions also will be δ functions. Thus, one
100 bins at each vertex. In each experiment, we choose a strategy for interpolation is to encode the problem proba-
subset V0 ⊆ V of vertices, propagate the histograms from bilistically using δ distributions, interpolate using Wasser-
these vertices to the remainder of V , and measure the error stein propagation, and then extract peaks of the propagated
between the propagated and ground-truth histograms. distributions. Experimentally we find that optima of the
linear program in §4 with peaked prescribed distributions
Figure 6a shows quantitative results of this experiment. Here yield peaked distributions ρv for all v ∈ V even when the
we show the average histogram error per vertex as a func- target is not Prob(R); we leave a proof for future work.
tion of the percent of nodes in V with fixed labels; the fixed
vertices are chosen randomly, and errors are averaged over In Figure 8, we apply this strategy to interpolating angles on
20 trials for each percentage. The Wasserstein strategy con- S1 from a single day of wind data on a map of Europe.3 Clas-
sistently outperforms naı̈ve PDF interpolation with respect sical Dirichlet interpolation fails to capture the identification
to our error metric and approaches relatively small error of angles 0 and 2π. Contrastingly, if we encode the bound-
with as few as 5% of the labels fixed. ary conditions as peaked distributions on Prob(S1 ), we can
interpolate using Wasserstein propagation without losing
Figures 6b and 6c show results for a single trial. We color structure. The resulting distributions are peaked about a sin-
the vertices v ∈ V by the mean (b) and standard deviation 2
WindSat Remote Sensing Systems
1 3
National Climatic Data Center Carbon Dioxide Information Analysis Center
(b)
(c)
(a)
Figure 6. We propagate histograms of temperatures collected over time to a map of the United States: (a) Average error at propagated sites
as a function of the number of nodes with labeled distributions; (b) means of the histograms at the propagated sites from a typical trial in
(a); (c) standard deviations at the propagated sites. Vertices with prescribed distributions are shown in blue and comprise ∼ 2% of V .
Ground truth PDF Wasserstein Ground truth PDF Wasserstein

(a) Histograms of wind directions (b) Entropy
Figure 7. (a) Interpolating histograms of wind directions using the PDF and Wasserstein propagation methods, illustrated using the same
scheme as Figure 5; (b) entropy values from the same distributions.
such as the surface mapping problem in Solomon et al.

(2013). Such an optimization, however, has O(m2 |E|) vari-
ables, which is intractable for dense or large graphs. An
open theoretical problem might be to reduce the number of
variables asymptotically. Some simplifications may also be
Ground truth PDF (19%) Wasserstein (15%) afforded using approximations like (Pele & Werman, 2009),
which simplify the form of dij at the cost of complicating
Figure 8. Learning wind directions on the unit circle S1 . theoretical analysis and understanding of optimal distribu-
tions ρv . Alternatively, work such as (Rabin et al., 2011)
suggests the potential to formulate efficient algorithms when
gle maximum, so we extract a direction field as the mode of replacing Prob([0, 1]) with Prob(S1 ) or other domains with
each ρv . Despite noise in the dataset we achieve 15% error special structure.
rather than the 19% error obtained by classical Dirichlet
interpolation of angles disregarding periodicity. In the end, our proposed algorithms are equally as
lightweight as less principled alternatives, while exhibit-
ing practical performance, theoretical soundness, and the
7. Conclusion possibility of extension into several alternative domains.
It is easy to formulate strategies for histogram propagation
by applying methods for propagating scalar functions bin-
by-bin. Here, however, we have shown that propagating
instead inverse CDFs has a deep connections to the theory of
optimal transportation and provides superior results, making
Acknowledgments The authors gratefully acknowledge
it a strong yet still efficient choice. This basic connection
the support of NSF grants CCF 1161480 and DMS 1228304,
gives our method theoretical and practical soundness that is
AFOSR grant FA9550-12-1-0372, a Google research award,
difficult to guarantee otherwise.
the Max Planck Center for Visual Computing and Commu-
While our algorithms show promise as practical techniques, nications, the National Defense Science and Engineering
we leave many avenues for future study. Most prominently, Graduate Fellowship, the Hertz Foundation Fellowship, and
the generalization in §4 can be applied to many problems, the NSF GRF program.
References Rabin, Julien, Peyre, Gabriel, Delon, Julie, and Bernot,

Marc. Wasserstein barycenter and its application to
Agueh, M. and Carlier, G. Barycenters in the Wasserstein
texture mixing. volume 6667 of LNCS, pp. 435–446.
space. J. Math. Anal., 43(2):904–924, 2011. 1
Springer, 2012. 1
Applegate, David, Dasu, Tamraparni, Krishnan, Shankar,
Rubner, Yossi, Tomasi, Carlo, and Guibas, Leonidas. The
and Urbanek, Simon. Unsupervised clustering of multi-
earth mover’s distance as a metric for image retrieval.
dimensional distributions using earth mover distance. In
IJCV, 40(2):99–121, November 2000. 6.2
KDD, pp. 636–644, 2011. 3
Singh, Aarti, Nowak, Robert D., and Zhu, Xiaojin. Unla-
Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps
beled data: Now it helps, now it doesn’t. In NIPS, pp.
and spectral techniques for embedding and clustering. In
1513–1520, 2008. 1
NIPS, pp. 585–591, 2001. 1
Solomon, Justin, Guibas, Leonidas, and Butscher, Adrian.
Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas.
Dirichlet energy for analysis and synthesis of soft maps.
Manifold regularization: A geometric framework for
Comp. Graph. Forum, 32(5):197–206, 2013. 1, 7
learning from labeled and unlabeled examples. JMLR, 7:
2399–2434, December 2006. 1 Subramanya, Amarnag and Bilmes, Jeff. Semi-supervised
learning with measure propagation. JMLR, 12:3311–
Bonneel, Nicolas, van de Panne, Michiel, Paris, Sylvain, and
3370, 2011. 1, 3, 2
Heidrich, Wolfgang. Displacement interpolation using
Lagrangian mass transport. Trans. Graph., 30(6):158:1– Talukdar, Partha Pratim and Crammer, Koby. New regular-
158:12, December 2011. 1, 3 ized algorithms for transductive learning. ECML-PKDD,
5782:442–457, 2009. 1
Chung, Fan and Yau, S.-T. Discrete Green’s functions. J.
Combinatorial Theory, 91(1–2):191–214, 2000. 2.1 Villani, Cédric. Topics in Optimal Transportation. Graduate
Studies in Mathematics. AMS, 2003. 1, 1
Chung, Soon-Yeong, Chung, Yun-Sung, and Kim, Jong-Ho.
Diffusion and elastic equations on networks. Pub. RIMS, Zhou, Xueyuan and Belkin, Mikhail. Semi-supervised learn-
43(3):699–726, 2007. 2.2, 3.1, 3.2 ing by higher order regularization. ICML, 15:892–900,
2011. 1
Coifman, Ronald R. and Lafon, Stéphane. Diffusion maps.
Applied and Computational Harmonic Anal., 21(1):5–30, Zhu, Xiaojin. Semi-supervised learning literature survey.
2006. 6.2 Technical Report 1530, Computer Sciences, University
of Wisconsin-Madison, 2008. 1
Irpino, Antonio, Verde, Rosanna, and de A.T. de Car-
valho, Francisco. Dynamic clustering of histogram data Zhu, Xiaojin, Ghahramani, Zoubin, and Lafferty, John D.
based on adaptive squared Wasserstein distances. CoRR, Semi-supervised learning using Gaussian fields and har-
abs/1110.1462, 2011. 3 monic functions. pp. 912–919, 2003. 1, 2.1, 2.2
Ji, Ming, Yang, Tianbao, Lin, Binbin, Jin, Rong, and Han,
Jiawei. A simple algorithm for semi-supervised learning
with improved generalization error bound. In ICML, 2012.
1
Krishnan, Dilip, Fattal, Raanan, and Szeliski, Richard. Effi-

cient preconditioning of Laplacian matrices for computer
graphics. Trans. Graph., 32(4):142:1–142:15, July 2013.
5
McCann, Robert J. A convexity principle for interacting

gases. Advances in Math., 128(1):153–179, 1997. 1
Pele, O. and Werman, M. Fast and robust earth mover’s

distances. In ICCV, pp. 460–467, 2009. 7
Rabin, Julien, Delon, Julie, and Gousseau, Yann. Trans-

portation distances on the circle. J. Math. Imaging Vis.,
41(1–2):147–167, September 2011. 7
Accelerat ing t he world's research.
Supervised learning to detect salt

body
Dainis Boumber
Cite this paper Downloaded from Academia.edu 
Get the citation in MLA, APA, or Chicago styles
Related papers Download a PDF Pack of t he best relat ed papers 
At ribut os Sísmicos na Caract erização de Reservat órios de Hidrocarbonet os Seismic At t ribut …

Agung Budi Prabowo
Seismic At t ribut es in Hydrocarbon Reservoirs Charact erizat ion

Daniel Amorim
A comparison of classiﬁcat ion t echniques for seismic facies recognit ion

Tao Zhao, Vikram Jayaram
Supervised learning to detect salt body
Pablo Guillen (University of Houston), German Larrazabal (Repsol USA), Gladys González (Repsol USA)
Dainis Boumber (University of Houston), Ricardo Vilalta (University of Houston)
Summary done using global optimization methods (Shi et. al., 2000;
Hale et. al., 2003). Another solution is to use unsupervised
In this paper we present a novel approach to detect salt learning techniques (Coléou et. al., 2003), often relying on
bodies based on seismic attributes and supervised learning. the application of Self Organizing Maps (Castro de Matos
We report on the use of a machine learning algorithm, et. al., 2007). Our new approach is essentially a novel salt
Extremely Randomized Trees, to automatically identify and body detection workflow. The workflow as a whole
classify salt regions. We have worked with a complex envisions the creation of a software solution that can
synthetic seismic dataset from phase I model of the SEG automatically identify, classify and delineate salt bodies
Advanced Modeling Corporation (SEAM) that corresponds from seismic data using seismic attributes and supervised
to deep water regions of the Gulf of Mexico. This dataset learning algorithms. A comparison between the salt body
has very low frequency and contains sediments bearing detected and its interpretation from 3D synthetic data set
amplitude values similar to those of salt bodies. In the first testifies to the effectiveness of our approach.
step of our methodology, where machine learning is
applied directly to the seismic data, we obtained accuracy Method
values of around 80%. A second (post-processing)
smoothing step improved accuracy to around 95%. We Automated classification of salt bodies using machine
conclude that machine learning is a promising mechanism learning
to identify salt bodies on seismic data, especially with
models that can produce complex decision boundaries, Our approach aims at automatically identifying and
while being able to control the associated variance delineating geological elements from seismic data.
component of error. Specifically, we focus on the automatic classification of
salt bodies using supervised learning techniques. In
Introduction supervised learning we assume each element of study is
represented as an n-component vector-valued random
Seismic-data interpretation has as its main goal the variable (X1, X2,..,Xn), where each Xi represents an
identification of compartments, faults, fault sealing, and attribute or feature; the space of all possible feature vectors
trapping mechanism that hold hydrocarbons; it additionally is called the input space X. We also consider a set {w1,
tries to understand the depositional history of the w2,...,wk} corresponding to the possible classes; this forms
environment to describe the relationship between seismic the output space W. A classifier or learning algorithm
data and a priori geological information. Data mining or typically receives as input a set of training examples from a
knowledge discovery in databases (KDD) has become a source domain, T = {(xi, wi)}, where x = (x1, x2,…,xn) is a
signiﬁcant area both in academia and industry. Data mining vector in the input space, and w is a value in the (discrete)
is the process of extracting novel, useful and output space. We assume the training or source sample T
understandable patterns from a large collection of data. consists of independently and identically distributed (i.i.d.)
Automated tools for knowledge discovery are frequently examples obtained according to a fixed but unknown joint
invoked in databases to unveil patterns that show how probability distribution, P(x,w), in the input-output space.
objects group into some classification scheme; algorithms The outcome of the classifier is a hypothesis or function
make use of higher order statistics, feature extraction f(x) mapping the input space to the output space, f: X →
methods, pattern recognition, clustering methods, and W. We commonly choose the hypothesis that minimizes
unsupervised and supervised classification. A major the expected value of a loss function (e.g., zero-one loss).
strategy in this field is to apply data mining algorithms
(Hastie, 2011) to classify points or parts of the 3D seismic The challenge behind classification of seismic data
data to reinforce correct data interpretations. Multiple
studies have shown the benefits of using data mining
Our workflow takes as input a cube of seismic data where
techniques for seismic-data interpretation. For example,
each voxel stands as a feature vector (we used three
previous work has shown how to generate a set of seismic
informative features as described below). From the whole
traces from velocity models containing faults with varying
cube we take a small fraction of representative voxels to
locality, using machine learning to identify the presence of
conform a training set T = {(xi, wi)}, where x = (x1, x2, x3);
a fault in previously unseen traces (Zhang et. al., 2014).
we assume only two classes: w1 and w2, corresponding to
Other techniques segment a seismic image into structural
voxels inside and outside the salt body, respectively. This
and stratigraphic geologic units (Hale, 2002), which is best
workflow is challenging because 1) the sheer size of the 3D from marine acquisition and represents strong challenges to
data cube precludes training predictive models with more the geophysical community. Inspiration was deep water
than just 1% of the available training data; this implies (600 – 2000 meters) US GOM Salt Structure and its major
several regions of the cube may not be fairly represented in structural features are salt body with rugose top and
the training set; 2) many learning algorithms are unable to overhangs, twelves radial faults near the root salt,
cope with millions of training examples; it took days to overturned sediment raft proximate to salt root and internal
complete the entire data processing; 3) classification is sutures and a heterogeneous salt cap. The migrated seismic
difficult because many voxels inside and outside salt bodies volume was obtained with very low frequency, and there
have very similar appearance. In machine learning are sediments locations with similar amplitude value than
terminology this is a problem known as high Bayes error. salt body. A migrated seismic volume with these kinds of
The success of this workflow is clearly contingent on features is very complex for detection of salt body.
finding useful and informative features to appropriately Mathematical and machine learning algorithms were taken
discriminate among classes. from Python’s Numpy and Scikit-learn libraries,
respectively. Our final predictive model of choice was
Informative attributes to generate predictive models in Extremely Randomized Trees, which was used to predict
seismic data the labels of 376,752,501 samples; this resulted in a
Boolean mask. The accuracy reported was essentially the
A proper characterization of voxels can be attained with same as in cross validation (80%). After that, we have
useful and informative features. We selected three features removed outliers and misclassification using mathematical
for our study exhibiting high correlation with the target morphological operations and a 3D interactive guided
class: signal amplitude (directly from seismic data), second (manual intervention) tool developed in house; finally, we
derivative, and curve length; the last two derived from used threshold segmentation using local average threshold
amplitude. Second derivative is instrumental to detect to get better detection results.
edges in images, and curve length capture patterns which
characterize different features observed inside a salt Results
structure and in its surroundings.
We describe our results by visually comparing our
Supervised learning algorithms
predictions on a cube of seismic data. Figure 1(a) shows a
cross section of the seismic data, figure 1(b) shows the
Our data analysis phase receives as input a body of seismic
classification obtained with our proposed methodology, and
data with the task of automatically identifying salt regions.
figure 1(c) shows the classification after the post-
We randomly sample a small fraction (0.5%) of the total
processing step.
data; the sample is then assigned class labels by an expert
(aided by a software tool that simplifies the labeling
process). To achieve a class-balanced problem, we made
sure exactly one half of the subset corresponded to salt, and
the other half as non-salt (the task exhibited equal class
priors). The model was built using 2 million training
voxels. Accuracy is estimated using 10-fold cross
validation (Hastie, 2011). The classification model was
subsequently used to automatically label the entire body of
seismic data (376,752,501 voxels). Our top performing
learning algorithms were the following: Gradient Boosting
Trees (Accuracy 80%), Extremely Randomized Trees
(Accuracy 80%), and Random Forests (Accuracy 79%). All
our learning algorithms are ensemble methods; these
techniques have shown remarkable performance due to
their ability to attain low bias (using complex decision
boundaries), and low variance (achieved by averaging over
various models).
Figure 1: (a) Seismic data, (b) classification using our method, (c)
Example results obtained with a post-processing step.
We have tested our proposed technique using SEAM I Figure 2 shows the overlapping between seismic data and
(SEG Advance Modeling Corporation) data. This comes salt body (white color) detected on different inline
locations. We can observe that seismic attributes used in around 95%. We conclude that machine learning is a
combination with the machine learning algorithm allows promising mechanism to identify geological bodies on
capturing and classifying different patterns and features seismic data when the selected model has high capacity,
between sediments and salt body. and is able to control the variance component of error by
model averaging (using ensemble techniques).
Figure 3: Overlapping between seismic data. (a) salt body

detected, and (b) interpretation.
Acknowledgments
The authors thank Repsol for its support and for the
authorization to present this work. We would like to
Figure 2: Overlapping between seismic data and salt body
acknowledge also SEG Advanced Modeling Corporation
detected.
(SEAM) for their initiative on creating a realistic salt model
and seismic data used for this study.
To measure accuracy, we count the number of hits between
the detected salt body and the interpretation in the
References
following way: using both volumes, we have counted the
number of hits voxel by voxel. We refer to this number as
NH. The effectiveness ratio is calculated as: (NH/TS) * Castro de Matos, M., Manassi, P. L., Osorio, Schroeder, P.
100, where TS is the total number of voxels in the volume. R. 2007. Unsupervised Seismic Facies Analysis using
Following this technique, we have obtained an accuracy of Wavelet Transform and Self Organizing Maps.
95.22%. GEOPHYSICS, 72 (1) pp. 9-21.
Figure 3 shows a comparison between our salt body Coléou, T., Poupon, M., Azbel, K. 2003. Unsupervised
Seismic Facies Classification: A Review and Comparison
detected (white color) and its interpretation (red color). We of Techniques and Implementation: The Leading Edge, 22,
can see the promising quality of our detection for the pp. 942–953.
synthetic seismic dataset used in this work.
Hale, D., Emanuel, J. 2003. Seismic Interpretation using
Conclusions Global Image Segmentation. 73th Annual International
Meeting, Society of Exploration Geophysicists.
We have shown an efficient approach to classify salt bodies
Hale, D., 2002. Atomic Meshes from Seismic Imaging to
from a very complex synthetic seismic dataset using
Reservoir Simulation. Proceedings of the 8th European
machine learning techniques. Results show very high
Conference on the Mathematics of Oil Recovery, Freiberg,
accuracy when machine learning algorithms are used to
Germany.
predict class labels of voxels on a seismic cube; this is true
even after training with a very small portion of the data
Hastie, T., Tibshirani, R., Friedman, J. 2011. The Elements
(0.5%). After a first step, where machine learning is applied
of Statistical Learning: Data Mining, Inference, and
directly to the data, we obtained accuracy values of around
Prediction. 2nd Edition, Springer.
80%. A second (post-processing) step increased accuracy to
Shi, J., Malik, J. 2000. Normalized Cuts and Image

Segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22 (8), pp. 888–905.
Zhang, C., Frogner, C., Araya-Polo, M., Hohl, D. 2014.

Machine Learning Based Automated Fault Detection in
Seismic Traces. Proceedings of 76th EAGE Conference
and Exhibition, Amsterdam.

A Supervised Learning Approach For Heading Detection

Uploaded by

Copyright:

Available Formats

You might also like

A Supervised Learning Approach For Heading Detection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Supervised Learning Approach For Heading Detection

Uploaded by

Copyright:

Available Formats

A Supervised Learning Approach For Heading

Sahib Singh Budhiraja and Vijay Mago

Lakehead University, 955 Oliver Rd, Thunder Bay, ON P7B 5E1

Abstract. As the Portable Document Format (PDF) file format in-

Keywords: Heading Detection · Text Segmentation · Supervised Ap-

While PDF format is convenient as it preserves the structure of a document

El-Haj et al. provide a practical application of document structure detection

3.1 Data Collection

To extract Font size and corresponding text:

3.2 Data Preprocessing

Fig. 1. Extraction of Data from Documents

Balancing The Dataset: The dataset is considered imbalanced if the preva-

Data Transformation: The process of transforming data into a form that

– Font Flag: Headings tend to be larger in terms of font size as compared

Fig. 2. Font Size Threshold Assumption Example

Table 1. List of all features

3.3 Feature Selection

After pre-processing, 14 training features are established. There is a need to

3.4 Grid Search

Tuning each classifiers parameters for optimal performance is performed using

Table 2. Selected features for each classifier

Classifier Name Selected Features

Box 1: Code Snippet for Training Decision Tree Classifier

treeclf = DecisionTreeClassifier(criterion = ‘gini’, splitter = ‘best’,

Support Vector Machine (SVM) It is a classifier that uses multi-dimensional

Box 2: Code Snippet for Training Support Vector Machine Classifier

svmclf = SVC(kernel=‘rbf’, degree=3, gamma=‘auto’, shrinking=True,

k-Nearest Neighbors The main idea behind k-Nearest Neighbors is that it

Box 3: Code Snippet for Training k-Nearest Neighbors Classifier

neighclf = KNeighborsClassifier(n neighbors = 10, weights = ‘distance’,

Box 4: Code Snippet for Training Random Forest Classifier

RandomForestClassifier(n estimators = 2, criterion = ‘gini’, max depth

Box 5: Code Snippet for Training Gaussian Naive Bayes Classifier

gaussianclf = GaussianNB(priors = [0.5, 0.5])

Quadratic Discriminant Analysis It works under the assumption that the

Box 6: Code Snippet for Training Quadratic Discriminant Analysis Classifier

quadclf = QuadraticDiscriminantAnalysis(priors = [0.5, 0.5], tol =

Logistic Regression It is a discriminative classifier, therefore it works by dis-

Box 7: Code Snippet for Training Logistic Regression Classifier

logisticRegr = LogisticRegression(penalty=l2, tol=0.0002, fit intercept =

Gradient Boosting This classification method uses an ensemble of weak pre-

Box 8: Code Snippet for Training Gradient Boosting Classifier

grdbstcf = GradientBoostingClassifier(loss = ‘deviance’, learning rate

Box 9: Code Snippet for Training Neural Net Classifier

nurlntclf = MLPRegressor(hidden layer sizes = (100, ), activation =

Table 3. Classifier Accuracy

Highest Value For Each Measure is Bold

Classifier Sensitivity Specificity Precision F1 Score Accuracy

Confusion Matrix Based Evaluation: We use evaluation parameters like

Table 4. AUC Values for all Classifiers

6 Testing The Generalizability

Testing the chosen classifier on a general set of documents is important to show

Table 5. Test Results For General Set

Feature Name Pearson Correlation Coefficient

7 Analysing The Results

The discussed configuration of Decision Tree is best suited to detect heading

8 Extending The Classifier

6 Experiment Design