Professional Documents
Culture Documents
Unlabeled Data - Semi-Supervised Classification (PU Learning) - by Alon Agmon - Towards Data Science
Unlabeled Data - Semi-Supervised Classification (PU Learning) - by Alon Agmon - Towards Data Science
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
The purpose of this post is to present one possible approach to PU problems which I have
recently used in a classification project. It is based on the paper “Learning classifiers
from only positive and unlabeled data” (2008) written by Charles Elkan and Keith Noto,
and on some code written by Alexandre Drouin. Although there are more approaches to
PU learning in scientific publications (I intend to discuss another rather popular
approach in a future post), Elkan and Noto’s (E&N) approach is quite simple and can be
easily implemented in Python.
E&N essentially claim that given a data set in which we have positive and unlabeled
data, the probability that a certain sample is positive [ P(y=1|x)] equals the probability
that the sample is labeled [P(s=1|x)] divided by the probability that a positive sample is
labeled in our data set [P(s=1|y=1)].
If this claim is true (and I’m not going to prove or defend it — you can read the proof in
the paper itself and experiment with the code), then it seems relatively easy to
implement. This is so because although we don't have enough labeled data to train a
classifier to tell us whether a sample is positive or negative, in a PU scenario we do have
enough labeled data to tell us whether a positive sample is likely to be labeled or not and,
according to E&N, this is enough to estimate how likely it is to be positive.
Putting things more formally, given an unlabeled data set with just a group of samples
labeled as positive, we can estimate the probability that unlabeled sample x is positive if
we estimate P(s=1|x) / P(s=1|y=1). Luckily, we can use almost any sklearn-based
classifier to estimate this according to the following steps:
(1) Fit a classifier on a data set containing labeled and unlabeled data while using an
isLabeled indicator as a target y. Fitting a classifier in this way will train it to predict the
probability that a given sample x is labeled — P(s=1| x).
(2) Use the classifier to predict the probability that the known positive samples in our
data set are labeled so that the predicted results will represent the probability that a
positive sample is labeled — P(s=1|y=1|x)
Calculate the mean of these predicted probabilities and that will be our P(s=1|y=1).
Having estimated P(s=1|y=1), all we need to do in order to predict the probability that
data point k is positive according to E&N is to estimate P(s=1|k) or the probability that
it is labeled which is exactly what the classifier we trained on (1) knows how to do.
(3) Use the classifier we trained on (1) to estimate the probability that k is labeled or
P(s=1|k).
# prepare data
classifier, ps1y1 =
fit_PU_estimator(x_data, y_data, 0.2, Estimator())
predicted_s = classifier.predict_proba(x_data)
The fit_PU_estimator() method completes 2 main tasks: it fits a classifier you choose on
a sample of the positive and unlabeled training set and then estimates the probability
that a positive sample is labeled. Correspondingly, it returns the fitted classifier (that
learned to estimate the proba that a given sample is labeled) and the estimated
probability P(s=1|y=1). After that, all we need to do is find P(s=1|x) or the probability
that x is labeled. Because that's what our classifier is trained to do, we just need to call its
predict_proba() method. Finally, in order to actually classify sample x we just need to
divide the result by P(s=1|y=1) that we have already found. This can be represented in
code as:
1
2
3 def fit_PU_estimator(X,y, hold_out_ratio, estimator):
4 # The training set will be divided into a fitting-set that will be used
5 # to fit the estimator in order to estimate P(s=1|X) and a held-out set of positive
6 # that will be used to estimate P(s=1|y=1)
7 # --------
8 # find the indices of the positive/labeled elements
9 assert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"
10 positives = np.where(y == 1.)[0]
11 # hold_out_size = the *number* of positives/labeled samples
12 # that we will use later to estimate P(s=1|y=1)
13 hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))
14 np.random.shuffle(positives)
15 # hold_out = the *indices* of the positive elements
16 # that we will later use to estimate P(s=1|y=1)
17 hold_out = positives[:hold_out_size]
18 # the actual positive *elements* that we will keep aside
19 X_hold_out = X[hold_out]
20 # remove the held out elements from X and y
21 X = np.delete(X, hold_out,0)
22 y = np.delete(y, hold_out)
23 # We fit the estimator on the unlabeled samples + (part of the) positive and labeled
24 # In order to estimate P(s=1|X) or what is the probablity that an element is *label
25 estimator.fit(X, y)
26 # We then use the estimator for prediction of the positive held out set
26 # We then use the estimator for prediction of the positive held-out set
27 # in order to estimate P(s=1|y=1)
28 hold_out_predictions = estimator.predict_proba(X_hold_out)
29 #take the probability that it is 1
30 hold_out_predictions = hold_out_predictions[:,1]
31 # save the mean probability
32 c = np.mean(hold_out_predictions)
33 return estimator, c
34
35 def predict_PU_prob(X, estimator, prob_s1y1):
36 prob_pred = estimator.predict_proba(X)
37 prob_pred = prob_pred[:,1]
38 return prob_pred / prob_s1y1
Classification results:
f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%
Classification results:
f1: 90.24%
roc: 91.11%
recall: 82.62%
precision: 99.41%
Few important notes. First, the performance of this approach greatly depends on the size
of the dataset. In this example, I have used about 150 positive samples and about 1200
unlabeled. This is far from being the ideal data set for this approach. If we only had 100
samples, for example, our classifier would have performed very poorly. Second, as the
attached notebook shows, there are a few variables to tune (such as the size of the
sample to be set aside, the probability threshold to use for classification, etc), but the
most important one is probably the chosen classifier and its parameters. I have chosen to
use XGBoost because it performs relatively well on small data sets with few features, but
it is important to note that it will not perform best in every scenario and it is important
to test for the right classifier.
Enjoy!
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
Classification Algorithms