Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Get started Open in app

Follow 579K Followers

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

Semi-Supervised Classification of Unlabeled


Data (PU Learning)
Alon Agmon Mar 6, 2020 · 7 min read

Photo by Bruno Martins on Unsplash


How to classify unlabeled data when all you have is just a few positive
samples
Suppose you have a dataset of payment transactions. Some of the transactions are
labeled as fraud and the rest are labeled as authentic, and you are required to design a
model that will distinguish between fraudulent and authentic transactions. Assuming
you have enough data and good features, this seems like a straightforward classification
task. However, suppose that only 15% of your data is labeled, and that the labeled
samples belong to just one class so that your training set consists of 15% samples labeled
as authentic while the rest are unlabeled and could be either authentic or fraudulent.
How would you go about classifying them? Has this twist in the requirements just
turned this task into an unsupervised learning problem? Well, not necessarily.

This problem — which is often referred to as a PU (positive and unlabeled) classification


problem — should be first distinguished from two similar and common “labeling issues”
that complicate many classification tasks. The first and most common type of labeling
issue is the problem of a small training set. It arises when although you have a decent
amount of data, just a small part of it is actually labeled. This problem has many
varieties and quite a few specific training methodologies. Another common labeling
issue (that is often conflated with PU problems) involves cases in which our training
data set is fully labeled, but it consists of just one class. Suppose, for example, that all we
have is a data set of non-fraudulent transactions and that we need to use this data set to
train a model to distinguish between (similar) non-fraudulent transactions and
fraudulent ones. This is also a common problem that is usually treated as an
unsupervised outlier detection problem, though there are also quite a few tools widely
available in the ML landscape that are specifically designed to handle these scenarios
(OneClassSVM might be the most famous).

A PU classification problem, in contrast, is a case involving a training set in which just


part of the data is labeled as positive while the rest is unlabeled and could be either positive
or negative. For instance, suppose that your employer is a bank that can provide you with
a lot of transactional data, but can only confirm that part of it is 100% authentic. The
example that I will use here involves a similar scenario with respect to fraudulent
banknotes. It includes a data set of 1200 banknotes, most of which are unlabeled while
just part of them is confirmed as authentic. Although PU problems are also quite
common, they are often much less discussed than the two classification problems
mentioned earlier, and very few hands-on examples or libraries are widely available.

The purpose of this post is to present one possible approach to PU problems which I have
recently used in a classification project. It is based on the paper “Learning classifiers
from only positive and unlabeled data” (2008) written by Charles Elkan and Keith Noto,
and on some code written by Alexandre Drouin. Although there are more approaches to
PU learning in scientific publications (I intend to discuss another rather popular
approach in a future post), Elkan and Noto’s (E&N) approach is quite simple and can be
easily implemented in Python.

A bit of theory (bear with me please…)

Photo by Antoine Dautry on Unsplash

E&N essentially claim that given a data set in which we have positive and unlabeled
data, the probability that a certain sample is positive [ P(y=1|x)] equals the probability
that the sample is labeled [P(s=1|x)] divided by the probability that a positive sample is
labeled in our data set [P(s=1|y=1)].

If this claim is true (and I’m not going to prove or defend it — you can read the proof in
the paper itself and experiment with the code), then it seems relatively easy to
implement. This is so because although we don't have enough labeled data to train a
classifier to tell us whether a sample is positive or negative, in a PU scenario we do have
enough labeled data to tell us whether a positive sample is likely to be labeled or not and,
according to E&N, this is enough to estimate how likely it is to be positive.

Putting things more formally, given an unlabeled data set with just a group of samples
labeled as positive, we can estimate the probability that unlabeled sample x is positive if
we estimate P(s=1|x) / P(s=1|y=1). Luckily, we can use almost any sklearn-based
classifier to estimate this according to the following steps:
(1) Fit a classifier on a data set containing labeled and unlabeled data while using an
isLabeled indicator as a target y. Fitting a classifier in this way will train it to predict the
probability that a given sample x is labeled — P(s=1| x).

(2) Use the classifier to predict the probability that the known positive samples in our
data set are labeled so that the predicted results will represent the probability that a
positive sample is labeled — P(s=1|y=1|x)

Calculate the mean of these predicted probabilities and that will be our P(s=1|y=1).

Having estimated P(s=1|y=1), all we need to do in order to predict the probability that
data point k is positive according to E&N is to estimate P(s=1|k) or the probability that
it is labeled which is exactly what the classifier we trained on (1) knows how to do.

(3) Use the classifier we trained on (1) to estimate the probability that k is labeled or
P(s=1|k).

(4) Once we have estimated P(s=1|k), we can actually classify k by dividing it by


P(s=1|y=1), which has been estimated on step (2), and get the actual probabilities that
it belongs to either class.

Now lets code and test this


Steps 1–4 above can be implemented as follows:

# prepare data

x_data = the training set


y_data = target var (1 for the positives and not-1 for the rest)

# fit the classifier and estimate P(s=1|y=1)

classifier, ps1y1 =
fit_PU_estimator(x_data, y_data, 0.2, Estimator())

# estimate the prob that x_data is labeled P(s=1|X)

predicted_s = classifier.predict_proba(x_data)

# estimate the actual probabilities that X is positive


# by calculating P(s=1|X) / P(s=1|y=1)

predicted_y = estimated_s / ps1y1

Let’s start with the main move here: fit_PU_estimator() method.

The fit_PU_estimator() method completes 2 main tasks: it fits a classifier you choose on
a sample of the positive and unlabeled training set and then estimates the probability
that a positive sample is labeled. Correspondingly, it returns the fitted classifier (that
learned to estimate the proba that a given sample is labeled) and the estimated
probability P(s=1|y=1). After that, all we need to do is find P(s=1|x) or the probability
that x is labeled. Because that's what our classifier is trained to do, we just need to call its
predict_proba() method. Finally, in order to actually classify sample x we just need to
divide the result by P(s=1|y=1) that we have already found. This can be represented in
code as:

1 pu_estimator, probs1y1 = fit_PU_estimator(


2 x_train,
3 y_train,
4 0.2,
5 b XGBCl ifi ())
5 xgb.XGBClassifier())
6
7 predicted_s = pu_estimator.predict_proba(x_train)
8 predicted_s = predicted_s[:,1]
9 predicted_y = predicted_s / probs1y1

pu_est2.py hosted with ❤ by GitHub view raw

The implementation of the fit_PU_estimator() method itself is quite self-explanatory :

1
2
3 def fit_PU_estimator(X,y, hold_out_ratio, estimator):
4 # The training set will be divided into a fitting-set that will be used
5 # to fit the estimator in order to estimate P(s=1|X) and a held-out set of positive
6 # that will be used to estimate P(s=1|y=1)
7 # --------
8 # find the indices of the positive/labeled elements
9 assert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"
10 positives = np.where(y == 1.)[0]
11 # hold_out_size = the *number* of positives/labeled samples
12 # that we will use later to estimate P(s=1|y=1)
13 hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))
14 np.random.shuffle(positives)
15 # hold_out = the *indices* of the positive elements
16 # that we will later use to estimate P(s=1|y=1)
17 hold_out = positives[:hold_out_size]
18 # the actual positive *elements* that we will keep aside
19 X_hold_out = X[hold_out]
20 # remove the held out elements from X and y
21 X = np.delete(X, hold_out,0)
22 y = np.delete(y, hold_out)
23 # We fit the estimator on the unlabeled samples + (part of the) positive and labeled
24 # In order to estimate P(s=1|X) or what is the probablity that an element is *label
25 estimator.fit(X, y)
26 # We then use the estimator for prediction of the positive held out set
26 # We then use the estimator for prediction of the positive held-out set
27 # in order to estimate P(s=1|y=1)
28 hold_out_predictions = estimator.predict_proba(X_hold_out)
29 #take the probability that it is 1
30 hold_out_predictions = hold_out_predictions[:,1]
31 # save the mean probability
32 c = np.mean(hold_out_predictions)
33 return estimator, c
34
35 def predict_PU_prob(X, estimator, prob_s1y1):
36 prob_pred = estimator.predict_proba(X)
37 prob_pred = prob_pred[:,1]
38 return prob_pred / prob_s1y1

pu_est1.py hosted with ❤ by GitHub view raw


In order to test this, I used the Bank Note Authentication data set, which is based on 4
data points that were extracted from images of genuine and forged banknotes. I first
used the classifier on the labeled data set in order to set a baseline, and then removed
the labels of 75% of the samples in order to test how it performs on a P&U data set. As
the output shows, it is true this data set is not one of the hardest to classify, but you can
see that although the PU classifier only “knew” about 153 positive samples while all the
rest 1219 were unlabeled, it performed quite well compared to a classifier that had all
labels available. However, it did lose about 17% of the recall and therefore lost quite a
few true positives. Yet, whenever this is all we have, then I believe that these results are
quite satisfying comparing to the alternatives.

===>> load data set <<===

data size: (1372, 5)

Target variable (fraud or not):


0 762
1 610

===>> create baseline classification results <<===

Classification results:

f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%

===>> classify on all the data set <<===

Target variable (labeled or not):


-1 1219
1 153

Classification results:

f1: 90.24%
roc: 91.11%
recall: 82.62%
precision: 99.41%
Few important notes. First, the performance of this approach greatly depends on the size
of the dataset. In this example, I have used about 150 positive samples and about 1200
unlabeled. This is far from being the ideal data set for this approach. If we only had 100
samples, for example, our classifier would have performed very poorly. Second, as the
attached notebook shows, there are a few variables to tune (such as the size of the
sample to be set aside, the probability threshold to use for classification, etc), but the
most important one is probably the chosen classifier and its parameters. I have chosen to
use XGBoost because it performs relatively well on small data sets with few features, but
it is important to note that it will not perform best in every scenario and it is important
to test for the right classifier.

The notebook is available here.

Enjoy!

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Data Science Machine Learning Classification Semi Supervised Learning

Classification Algorithms

You might also like