Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

On semi-supervised learning

Juan Francisco Agreda Vega


Universidad Nacional de Ingeniería

Abstract Consistency of the algorithm A second example using simulated data


This work is based on the article by (Cholaquidis, A., et al., 2020) which details that: Semi- To prove the consistency of the algorithm additional conditions are required. They involve regu- To generate the data consider two bi-variate normal random vectors Z0 ∼ N (µ0, Σ) and Z1 ∼
supervised learning (SSL) has a strong historical foundation and combines labeled and unlabeled larity properties of different sets and the rate at which hl decreases. N (µ1, Σ). Let Y ∼ Bernoulli(0.5). The conditional distribution of X given Y = y, for y = 0, 1,
data for joint classification. The convergence of SSL towards Bayes risk has been extensively stud- is given by X|Y = y ∼ Zy |kZy − µy k < 1.5.
ied. The success of SSL relies on the cluster assumption and can achieve satisfactory results with
a small number of labeled examples. A proposed new algorithm approaches the optimal theoret-
ical performance under specific conditions. However, SSL is most effective in well-conditioned
problems. The algorithm’s performance is evaluated using real phoneme data. The first one shows
that when the size l of the unlabelled sample is equal to infinity, the classification error converges A faster algorithm
exponentially fast to the Bayes risk, if the size n of the labelled sample converges to infinity. In
the second one it is assumed that the density of the covariates is given by a parametric model The idea is to pre-process the sample Xl , and project it on a grid Gl , as we describe in what
p(x) = πp(x|y = θ) + (1 − π)p(x|y = 1 − θ), where p(x) is known except for the parameters follows. We can assume, without loss of generality,that Xl ∪ Xn ⊂ (a, b)d with a < b. For N fixed,
θ ∈ {0, 1} and π ∈ (0, 1). Under regularity conditions consistency is shown if the minimum be- to be determined by thepractitioner, consider ai = a+i(b−a)/N for i = 0, · · · , N . The N −grid Gl
tween n and l converges to infinity. Roughly speaking, this condition imposes p(x) to have a deep on (a, b)d is determined by the Nd points of the form a = (ai1, · · · , aid) with ij ∈ {0, · · · , N − 1},
valley between the classes. In other words, clustering techniques have to perform reasonably for j = 1, · · · d. Each point a in the grid determines a cell Ca = Πdj=1(aij , aij+1].
well in the presence of only unlabelled data. Smoothness of the labels with respect to the fea-
tures, or low density at the decision boundary, are examples of the kind of hypotheses required Examples with simulated and real data
to get satisfactory results in the cluster analysis literature.The algorithm is of the “self-training” Figure 2. The two populations of bi-variate truncated Gaussian distributions.
type; this means that at every stepa point from the unlabelled set is labelled using the training In this section we report some numerical results, comparing the performance of the SSM algo-
sample built up to that step, and incorporated into the training sample. In this way the training rithm, with that of other supervised algorithms. Specifically, k−nearest neighbors (k − n) and
sample increases from one step to the next. A simplified, computationally more efficient alter- support vector machines (SV M ) are the supervised techniques used to assign labels of each el- A real data example
native algorithm is also provided. Throughout this work, we used I, A, B to denote probability ement in Xl on the basis of the training sample Dn. The classification error rate of each algorithm
The data come from 150 people who spoke the name of each letter twice. There are three missing
events, namely, subsets of a (rich enough) probability space (Ω, Σ, P ). Instead I, A, B are used is computed in three scenarios. In the first two, we use artificially generated data, whereas in the
data, not considered in the study.
to denote subsets in the Euclidean space Rd. In most of the cases, the probability events are last one we employ a real data set. The first example compares efficiency of the three algorithms
defined through conditions on the random variables that concern Rd. Was used the same letter (k − nn, SV M ) and the SSL algorithm. The second one shows the effect of the grid size with
in different styles with the hope to facilitate the reading of the work. Is considered Rd endowed respect to classification error rate and computational time. The third one is a well known real-data
with the Euclidean k . k. The open ball of radius r ≥ 0 centered at x is denoted by B(x, r). With set where we illustrate the crucial effect of the initial training sample Dn.
a slight abuse of notation, if A ⊂ Rd, then we write B(A, r) = ∪s∈AB(s, r).
A first simulated example
Theoretical best rule
The joint distribution of (X, Y ) is generated as follows: consider first the curve C in the square
It is well known that the optimal rule for classifying a single new datum X is given by the Bayes rule, [1, 1]2, defined by C = (x, (1/2) sin(4x)) : 1 ≤ x ≤ 1. All the points in the square that are below C Table 1: Average of the computation time and miss-classification errors over 50 replications.
g ∗(X) = I{η(X)≥1/2}. In the present paper, we move from the classification problem of a single will be labeled with Y = 0 while those that are above the curve C will be labeled with Y = 1. Now,
datum X to a framework where each coordinate of Xl = (X1, · · · , Xl ) must be classified.The next to emulate the valley condition, those points close to C will be chosen with less probability than
result establishes that the optimal classification rule classifies each element ignoring the presence those far away. To do so, let S1 and S2 denote the set of points in the square which are at k.k∞− Concluding remarks
of the rest of the observations, by means of invoking the Bayes rule. distance larger / smaller than 0.2 from C, respectively. Namely, S1 = {Bk.k∞ (C, 0.2)}c ∩[−1, 1]2 and
His work focuses on semi-supervised learning by proposing a simple algorithm and studying its
S2 = {Bk.k∞ (C, 0.2)} ∩ [−1, 1]2, where k.k∞ is the supremum norm. Let U1, U2 and B be indepen-
long-term behavior. The effectiveness of semi-supervised learning relies on certain assumptions.
Algorithm dent random variables, with U1 ∼ Uniform(S1), U2 ∼ Uniform(S2) and B ∼ Bernoulli(7/8). Con- Simulations demonstrate how the algorithm performs compared to other methods. It outperforms
sider the random variable X = BU1 +(1−B)U2, while (X, Y ) = ((X1, X2), 1) if X2 > (1/2) sin(4X1) competitors when the amount of unlabeled data exceeds 500. The trade-off between computa-
An algorithm is provided that is asymptotically optimal in the sense of satisfying condition. To and (X, Y ) = ((X1, X2), 0) if X2 ≤ (1/2)sin(4X1). tion time and efficiency is investigated, and real-world data results show slight improvement with
accomplish this, we sequentially update the training sample by incorporating an observation Xj i
the semi-supervised learning algorithm.
from Xl into the initial set Dn, along with a predicted label Ỹji ∈ {0, 1}. This approach allows us
to select the ”best classifiable point” from the remaining unclassified observations.
References
Initialization: Let Z0 = X n, U0 = Xl , T0 = Dn
STEP j : For j in 1, · · · , l, choose the best classifiable point in Uj−1, from those that are Cholaquidis, A.; Fraiman, R.; Sued, Raquel M. (2020). On semi-supervised learning. https://ri.con-
at a distance smaller than hl from the points already classified, as follows: let Uj−1(hl ) = icet.gov.ar/handle/11336/147485
X ∈ Uj−1 : d(Zj−1, X) < hl ; for Xi ∈ Uj−1(hl ), consider Xij = arg max max{η̂j−1(Xi), 1 − Asuncion, A., and Newman, D.J. (2007). UCI Machine Learning Repository,
η̂j−1(Xi)}. http://www.ics.uci.edu/ mlearn/MLRepository.html, University of California, Irvine, School
of Information and Computer Sciences
Figure 1. Left panel: histogram of the classification error.
Belkin, M., and Niyogi, P. (2004). Semi-supervised learning on Riemannian manifolds. Machine
Learning 56: 209–239.

CTI: https://dina.concytec.gob.pe/appDirectorioCTI/VerDatosInvestigador.do;jsessionid=96ce21d52ba91d5f3c3ea049146b?idinvestigador = 155242 juan.agreda.v@uni.pe

You might also like