AdaBoost Based Transfer Learning Method For Positive An 2022 Knowledge Based

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Knowledge-Based Systems 241 (2022) 108162

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

AdaBoost-based transfer learning method for positive and unlabelled


learning problem

Bo Liu a , , Changdong Liu a , Yanshan Xiao b , Laiwang Liu a , Weibin Li a , Xiaodong Chen a
a
School of Automation, Guangdong University of Technology, Guangzhou, 510006, China
b
School of Computers, Guangdong University of Technology, Guangzhou, 510006, China

article info a b s t r a c t

Article history: Positive and unlabelled learning (PU learning) is a problem that the training of a classifier only utilizes
Received 23 January 2021 labelled positive examples and unlabelled examples. Recently, PU learning has been widely studied
Received in revised form 28 June 2021 and used in a number of areas. In this paper, we present an AdaBoost-based transfer learning method
Accepted 3 January 2022
to solve PU Learning problem, which is briefly called AdaTLPU. In the proposed model, by sharing
Available online 20 January 2022
SVM parameters and regularization terms, the source task knowledge is transferred to the target
Keywords: task. At the same time, the similarity of the ambiguous examples towards the positive and negative
Transfer learning classes is taken into account to refine the decision boundary of the classifier. Meanwhile, we adopt the
PU learning AdaBoost method to ensemble the obtained weak classifiers to form a strong classifier for prediction.
AdaBoost method In addition, we put forward an iterative optimization method to obtain the classifier and present the
proof of training error bound for the proposed method. Finally, we organize experiments to explore
the performance of AdaTLPU and the results indicate that AdaTLPU can achieve the better performance
compared with previous PU learning methods.
© 2022 Elsevier B.V. All rights reserved.

1. Introduction first kind of algorithms is two-step strategy which first identifies


positive or negative examples from the unlabelled set and then
As a traditional machine learning problem, supervised learning utilizes the original positive examples and the reliable negative
methods usually train classifiers with labelled positive and nega- examples to train a classifier. For example, in the work of PU-
tive examples. In fact, it is not easy to obtain negative examples LP [12], it extracts reliable negative examples from unlabelled
in practical applications [1]. For example, when we use Tik Tok to class iteratively by enlarging the existing positive class and then
watch short videos, we can click the ‘‘like’’ button when we are builds the label propagation classifier. The second kind of algo-
interested in this video. When we encounter a video that we are rithms, biased PU learning, is a kind of SVM-based method, which
not interested in or annoying, we usually just swipe it directly. treats the unlabelled examples as negative examples with class
Even for the videos not marked as ‘‘like’’, they may include our label noise and assigns different penalties to the examples from
favourite videos and unlike videos. In this kind of situation, the positive and negative class. In the work of [8], the authors adopt a
APP will receive positive examples(like items) or unlabelled ex- label disambiguation strategy based on SVM to find a hyperplane
amples(like or unlike items), which forms the problem of positive at the maximum margin between the most likely labels and the
and unlabelled learning. Therefore, PU learning problem has been less likely ones. The last kind of algorithms suggests PU classi-
studied to deal with the case that we need to train the classifier fication can be cast as a cost-sensitive learning, which regards
with only positive examples and unlabelled examples [2]. unlabelled examples as negative by reweighting the penalty for
In past few years, many PU learning algorithms have been pro- each class via a manual or automatic way. For example, the work
posed. Considering how unlabelled examples are treated, the ear- of [10] finds that the risk estimator is unbiased if the surrogate
lier paper about positive and unlabelled learning can be mainly loss is non-convex and meets a symmetric condition.
grouped into three categories, including two-step strategy [2–6], Although PU learning problem has been studied by many
biased PU learning [5,7,8], and cost-sensitive learning [9–11]. The researchers, there is little work to study the problem of PU learn-
ing with transfer learning. Transfer learning is able to increase
∗ Corresponding author. the classification accuracy of target task by the knowledge of
source task. For example in [13], the authors propose an ensem-
E-mail addresses: csboliu@163.com (B. Liu),
liuchangdong1997@foxmail.com (C. Liu), xiaoyanshan@189.cn (Y. Xiao),
ble transfer learning framework, in which the weights of base
liulw0909@hotmail.com (L. Liu), csweibinli@hotmail.com (W. Li), classifiers are flexibly assigned by the test examples in the target
csxdchen@hotmail.com (X. Chen). domain. In the work [14], Chougrad et al. is concerned about the

https://doi.org/10.1016/j.knosys.2022.108162
0950-7051/© 2022 Elsevier B.V. All rights reserved.
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

early diagnosis of breast cancer and introduce transfer learning examples and unlabelled examples to solve this problem. For PU
to multi-label learning based on a fine-tuning strategy. Zheng learning, a lot of researches have been done and we can roughly
et al. [15] propose a dictionary-based transfer learning method, divide the previous methods into following three categories.
which utilizes the input examples to obtain sparse representation The first category is two-step strategy [2,3,5,19], which first
and then trains the classifier with the obtained sparse represen- uses the unlabelled set to extract positive or negative examples
tation. In addition, boosting has been proposed to construct a and then trains classifier with the reliable negative examples
strong classifier by ensembling several weak classifiers [16]. This and the original positive examples. For example, Liu et al. [2]
can be beneficial for constructing a strong classifier when only propose a method called S-EM, which builds an initial classifier
a small number of training data is available. Moreover, boosting to select those documents that are most likely to be negative
can reduce the influence of the individual classifier and integrate documents from the mixed dataset. After that, they utilize EM
the weak classifiers into a robust classifier. Considering transfer algorithm to obtain a good local maximum of the likelihood func-
learning has been widely studied, it is essential to consider the tion. Li et al. [3] use Rocchio method to extract reliable negative
problem of transfer learning-based PU learning, in which boosting examples and SVM technique for classifier building.
is considered to ensemble a strong classifier for prediction. The second category is biased PU learning method [5,7,20]
In the paper, we solve the problem of transfer learning with that treats the unlabelled examples as negatives examples with
positive and unlabelled data. To build the learning model, we class label noise. These methods will place higher penalties on
mainly face two challenges. Firstly, when positive and unlabelled misclassified positive examples in order to overcome the class
data from the source and target domains can be obtained, how to label noise. Ke et al. [20] put forward a PU learning classifier
use them to guide the model to learn a classifier for prediction. by combining biased least square SVM (BLSSVM) and a smooth
Secondly, how to construct the strong transfer learning-based regularization term. Sellamanickam et al. [21] propose a new
classifier by integrating the weak transfer learning classifiers with SVM-based approach, which includes a margin-based dual loss
positive and unlabelled data. In order to resolve the above two function. In the approach, the positive and negative class ex-
challenges, we propose a novel approach, called AdaBoost-based amples obtain the penalty, and the method uses an adjustment
transfer learning method for PU learning problem(AdaTLPU). We formula to set the threshold and regularization parameter in the
build the transfer learning model based on shared parameter in objective function. However, in this kind methods, it is challeng-
the SVM and the regularization terms are used to transfer the ing to place proper penalties for the label noise. If the penalties
knowledge from the source task. In the proposed method, we first are not selected properly, this will reduce the performance of the
identify reliable negative examples from unlabelled data for the methods.
target task and source task respectively, and then calculate the The last category suggests that PU classification can be cast
similarity for the ambiguous examples for the rest data of the as a cost-sensitive learning, which regards unlabelled examples
unlabelled data. We then build the transfer learning-based PU as negative by reweighting penalty to each class [9–11]. For
model which incorporates the positive and reliable examples, am- example, Du Plessis et al. [10] first show that the PU learning
biguous examples with similarity weights into the learning. With problem can be solved by cost-sensitive learning and find that
the derived classification errors, we can adopt AdaBoost method using non-convex loss functions can avoid leading to a wrong
to ensemble the weak classifiers to build a strong classifier. classification boundary. The work in [22] introduces a more gen-
In all, the novelty of this paper is that it is the first time eral risk estimator to overcome the non-convexity, which uses
that transfer learning and AdaBoost are incorporated into positive ordinary convex loss function and composite loss function for
and unlabelled learning in the best of our knowledge. At the the unlabelled examples and positive examples. Although these
same time, we incorporate the similarity values for the ambigu- methods work well in practice, their performance depends largely
ous examples contained in the unlabelled set into the transfer on the reliability of the loss function for the PU data. In fact,
learning-based classifier so that the ambiguous examples can these methods are unreliable in the general cases, which limits
contribute to the classifier construction according to their sim- their applications in practice. In contrast, our proposed method
ilarity values. As a result, we can obtain a superior classifier for is model-based and the objective function is theoretically based
prediction. Further, there also exists a challenge to solve the pro- on the principle of structural risk minimization and empirical risk
posed model. We then utilize the quadratic programming method minimization.
to solve the AdaTLPU model efficiently and analyse the conver- For positive and unlabelled learning problem, the previous
gence of the proposed method based on the derived training work always utilizes the knowledge of single domain to build
errors. the classifier for prediction. However, in many positive and un-
The rest of this paper is organized as follow. Section 2 intro- labelled learning conditions, the model falls to achieve good
duces the related work. Section 3 proposes the AdaTLPU method performance due to the little number of examples. Thus, in order
and obtains an optimization algorithm to solve the presented to improve the performance of the model, we can use transfer
model. Section 4 shows the experiments and results. Section 5 learning to transfer domain knowledge with sufficient examples
concludes the paper. to the target domain with insufficient examples so that the clas-
sifier built on the target domain can have superior prediction
2. Related work results.

2.1. Positive and unlabelled learning 2.2. Transfer learning

With the rapid development of information technology, hu- In the recent years, transfer learning [23–25] in data mining
man produce massive data all the time. In some areas, like in- and machine learning become an important topic. Different from
formation retrieval [9,17] and text classification [1,18], people the traditional machine learning for one task, transfer learning
often have some positive data and unlabelled data. Meanwhile, makes use of source domain data and target domain data to train
it is expensive and laborious to label data manually. Due to the model. For the previous transfer learning methods, they can be
reason, positive and unlabelled learning has draw more and more roughly classified into following four categories.
attention in recent years. Positive and unlabelled learning is a The first approach is instance-transfer [26–28], which is based
kind of binary classification problem, which utilizes only positive on the precondition that the source task data and target task
2
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

data have certain part data similar. In practice, it often re-weights method in Section 3.3. In Section 3.4, we solve the presented
some source domain data and then utilizes them in target do- SVM-based problem and obtain the dual problem. With the dual
main. Jiang et al. [26] propose a new method based on conditional problem, we ensemble the base classifiers by AdaBoost method
probability, which is to eliminate the incorrect training examples to form a strong classifier.
from the source task data. Liao et al. [27] propose an active
learning method, which utilizes the source task data to choose 3.1. Reliable negative examples extraction
unlabelled data in the target task. Wu et al. [28] assume the
source domain data come from different distributions and present We assume that Ss and St denote the source domain and
a new SVM-based method to improve accuracy of the classifier. target domain respectively, and each domain contains positive
This kind of approaches is based on instance to transfer knowl- and unlabelled dataset. For the source and target domain, we
edge, which needs to measure the similarity and assign weight to have Ss = {PSs , USs } and St = {PSt , USt }. Here, PSs and PSt denote
the instances used to transfer knowledge. And this makes it less the labelled positive examples for two domains, and USs and USt
efficient than parameter-based transfer learning. represent the unlabelled examples for two domains.
The second approach is feature-representation-transfer [29, We extract the reliable negative examples in the source do-
30], which needs to obtain appropriate feature representation main and the target domain and put them in subset NSs and
that can let the divergences between the source and the target NSt respectively. We take the source domain as an example to
tasks as small as possible. According to whether the source task explain this process, and the target domain operates the same
data is labelled or not, they can be divided into supervised and way. Firstly, we utilize both Rocchio technique [3] and Spy tech-
unsupervised feature construction. Jebara [29] proposes a SVM- nique [2] to obtain the most reliable negative examples, and put
based multi-task learning method, which is used in multiple them into subsets Ss1 and Ss2 respectively. We believe that the ex-
classification tasks under different labelled datasets conditions. amples agreed by both techniques are most the reliable negative
In addition, Lee et al. [30] propose a multi-task learning method, examples. That is, NSs = Ss1 ∩ Ss2 . For Rocchio technique [3], It first
which can learn feature weights and meta-priors from each task assigns the unlabelled set U to the negative class, and positive
simultaneously and each task utilizes the meta-priors to transfer set P to the positive class. Then, it uses x to denote one certain
the knowledge. This kind of methods is more dependent on example, and we can calculate the positive prototype vector c +
feature representation and feature selection. Once there is no and the negative prototype vector c − :
suitable feature to transfer knowledge, it will not obtain desirable
1 ∑ x 1 ∑ x
performance. c+ = α −β , (1)
The third approach is parameter-transfer [13,31], which is to |P | ∥x∥ |U | ∥x∥
x∈P x∈U
share prior distributions or some hyperparameters in the model. 1 ∑ x 1 ∑ x
In the work [31], Evgeniou et al. propose transfer learning method c− = α −β . (2)
|U | ∥x∥ |P | ∥x∥
for SVM, which transfers the knowledge of source task to target x∈U x∈P
task via common parameters. Gao et al. [13] propose an ensemble
Finally, for each example x in unlabelled set U, if Sim(c + , x) ≤
transfer learning framework, in which the weights are depend on
Sim(c − , x) then, put the example x into negative set N, and the
the predictive ability model.
examples in set N are the reliable negative examples. For Spy
The last approach is relational-knowledge-transfer [32,33],
technique [2], it first puts some labelled examples(Spies) into
which does not assume that the data are drawn from each
unlabelled set to obtain spy data. And then, the technique utilizes
domain by independently and identically distributed. This kind of
the spy data to determine a threshold t. With the threshold t,
methods always builds the map of relational knowledge between
we can use it to estimate the most likely negative examples. In
source task and target task. For example, Mihalkova et al. [32]
practice, for a specific example x, if the probability satisfies the
propose an approach based on Markov Logic Networks(MLNs),
following equation:
which can transfer relational knowledge across the relational do-
mains. Richardson et al. [33] present a method to integrate First- Pr[c − |x] < t , (3)
Order Logic(FOL) and Probabilistic Graphical Models(PGM) in a −
united representation for statistical relational learning. Although we can assume x belongs to the negative class c . Then, put the
this kind of approaches does not require the data distribution to example x into negative set N, and the examples in set N are the
be independently and identically distributed, it needs to build a reliable negative examples. For more details of Rocchio technique
relational map between source task and target task which leads and Spy technique, one can refer to [2,3].
to the limitation of this approach. After the reliable negative examples are obtained, we remove
Although transfer learning has always received much atten- them from the unlabelled data subset, i.e., USs = USs − NSs .
tion. Most of the transfer learning methods aim at learning with Similarly, we can obtain set NSt and USt .
certain labels. They do not explicitly solve the problem of transfer
learning with positive and unlabelled learning and do not transfer 3.2. Similarity weight generation
knowledge from the source task to the target task where both
tasks contain unlabelled examples. In this paper, we propose a After the above step, the datasets are separated into three
transfer learning-based framework to address the positive and pairs of subsets: positive sets PSs , PSt , reliable negative sets NSs ,
unlabelled learning problem, and build the predictive classifier NSt and ambiguous sets USs , USt . So as to make full use of the
for the target domain by transferring knowledge from the source data in the USs , USt , we put forward the similarity-based data
domain. model and calculate the corresponding similarity values in the
method. For the ambiguous example x in the sets USs and USt ,
3. The proposed method the similarity-based data model is presented as follows:

{x, (m+ (x), m− (x))}, (4)


In this section, we present an AdaBoost-based transfer learn-
+ −
ing method for PU learning problem. We introduce the reliable in which m (x) and m (x) are similarity weights to represent
negative examples extraction in Section 3.1 and the similarity the degree of x towards the positive and negative classes, respec-
weight generation in Section 3.2. We then propose the AdaTLPU tively.
3
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
( )
To calculate similarity of each example in USs and USt , we ∑ ∑
set up the positive prototypes and the negative prototypes for + Csl ms (xi )ξil +
+
ms (xj )ξjl

source and target domains respectively. Take source domain as PSs ∪USs NSs ∪USs
( )]
an example, we firstly use K-mean clustering to cluster the exam- ∑ ∑
ples in NSs into m micro-clusters, denoted as NSs1 , NSs2 , . . . , NSsm + Ctl mt (xk )ξkl +
+
mt (xh )ξhl

,
where ms = t |NSs |/(|USs | + |NSs |) and t is set to be 30 in the PSt ∪USt NSt ∪USt

experiments [4,34]. Then, the kth positive prototype, denoted as


psk , and the kth negative prototype, denoted as nsk , are built as s.t. (ω0l + vsl )T · φ (xi ) + b1l ≥ 1 − ξil , xi ∈ PSs ∪ USs
follows:
(ω0l + vsl ) · φ (xj ) + b1l ≤ −1 + ξjl ,
T
xj ∈ NSs ∪ USs
α ∑ x β ∑ x
psk = − , k = 1, 2, . . . , m, (5) (ω0l + vtl )T · φ (xk ) + b2l ≥ 1 − ξkl , xk ∈ PSt ∪ USt
|PSs | ∥x∥ |NSsk | ∥x∥
x∈PSs x∈NSsk
(ω0l + vtl )T · φ (xh ) + b2l ≤ −1 + ξhl , xh ∈ NSt ∪ USt
α ∑ x β ∑ x
nsk = − , k = 1, 2, . . . , m. (6) ξil ≥ 0, ξjl ≥ 0, ξkl ≥ 0, ξhl ≥ 0
|NSsk | ∥x∥ |PSs | ∥x∥
x∈PSs x∈PSs
1T µ = 1, µ = [µ1 , µ2 , . . . , µL ]T . (13)
where ∥x∥ denotes the norm of example x, α and β are suggested
to set 16 and 4 [4,34]. For the target domain, NSt , ptk and ntk can For the objective function above, we briefly introduce it as follow:
be obtained in the same way. • ω0l is common parameter for the lth classifier of two do-
For the ambiguous example xi in subset USs , we calculate mains, νsl and νtl are the private parameters for the lth
its similarity to each of the positive and negative prototypes as
classifier of two domains respectively. λsl and λtl are transfer
follows.
learning parameters, Csl and Ctl are parameters to balance
xi · psk
Sim(xi , psk ) =   , k = 1, 2, . . . , m, (7) the margin and errors. ξil , ξjl , ξkl and ξhl are slack variables.
∥xi ∥ · psk  • m+s (xi )ξil and ms (xj )ξjl can be considered as the errors for

xi · nsk PSs , NSs and USs with different weights in the source do-
Sim(xi , nsk ) = , k = 1, 2, . . . , m. (8)
∥xi ∥ · ∥nsk ∥ main. We notice that, a smaller value of m+ (xi ) can weaken
For the ambiguous example xi , the corresponding similarity the role of ξil , thus the example xi turns out to be less
weights towards the positive and negative classes are computed important towards positive class. Similarly, m+ t (xk )ξkl and
as follows: m−t (xh )ξhl are the errors for the target domain.
∑m
Sim(xi , psk )
m+ (xi ) = ∑m k=1
, (9)
k=1 (Sim(xi , psk ) + Sim(xi , nsk )) 3.4. Dual problem
∑m
k=1 Sim(xi , nsk )
m− (xi ) = ∑m . (10) To solve the Problem (13), we first solve the lth classifier and
k=1 (Sim(xi , psk ) + Sim(xi , nsk ))
then ensemble L obtained classifiers into a strong classifier. To
For the above m+ (xi ) and m− (xi ), we find that, if the example simplify the presentation, we let Ss+ = PSs ∪ USs , Ss− = NSs ∪ USs ,
xi resides close to the positive prototype, its similarity towards Ss = Ss+ ∪ Ss− , St + = PSt ∪ USt , St − = NSt ∪ USt , St = St + ∪ St − .
the positive class is larger; otherwise, the similarity towards By introducing Lagrangian multipliers αils+ , αjls− , αkl t+
and αhl
t−
for
the negative class is larger. The same way, we can obtain the
the set Ss+ , Ss− , St + and St − , the lth basic classifier is to solve the
similarity weights of examples in USt .
follow dual problem (The proof of (14) is presented in Appendix).

3.3. Objective function 1 λsl λtl


min ∥ω0l ∥2 + ∥νsl ∥2 + ∥νtl ∥2
α 2 2 2
In the paper, we solve the problem of transfer learning-based
∑ ∑
− αils+ (ω0l + vsl )T φ (xi ) + αjls− (ω0l + vsl )T φ (xj )
positive and unlabelled learning by extending support vector Ss+ Ss−
machines (SVMs), since SVM is built based on the principle of ∑ ∑
structural risk minimization and empirical risk minimization [35]. − αt+
kl (ω0l + vtl ) φ (xk ) +
T
αhlt − (ω0l + vtl )T φ (xh )
Both structural risk minimization and empirical risk minimization St + St −
can guarantee the built classifier has superior performance. For
∑ ∑ ∑ ∑
+ αils+ + αjls− + αklt + + αhlt − ,
the source domain and the target domain, suppose the ωs and ωt
Ss+ Ss− St + St −
in SVM can be presented as follows.

ωs = ω0 + vs , 0 ≤ αils+ ≤ Cs m+
s (xi ), 0 ≤ αjl ≤ Cs ms (xj )
(11) s.t. s− −

ω t = ω 0 + vt . 0 ≤ αkl t (xk ), 0 ≤ αhl ≤ Ct mt (xh )


(12) t+
≤ Ct m+ t− −

where ω0 is a common parameter and vs and vt are private


∑ ∑ ∑ ∑
αils+ − αjls− = 0, αklt + − αhlt − = 0, (14)
parameters for the two domains. We let fs = ωs · φ (x) + b1 and Ss+ Ss− St + St −
ft = ωt ·φ (x) + b2 are two hyperplanes for both tasks respectively.
For the AdaBoost-based transfer learning method for PU learn- where
ing, assume there are L weak classifiers, and weight of each 1 ∑ s+ ∑
classifier is denoted as µl , by taking the examples in the positive vsl = ( αil φ (xi ) + αjls− φ (xj )), (15)
λsl
set, negative set, and the ambiguous data set, the proposed model Ss+ Ss−

is formulated as follows: 1 ∑ t+ ∑
vtl = ( αkl φ (xk ) + αhlt − φ (xh )), (16)
L λtl
λsl λtl
[
∑ 1 St + St −
min µl ∥ω0l ∥ +2 2
∥νsl ∥ + ∥νtl ∥ 2
ω0 ,ν,b,ξ ,µ
l=1
2 2 2 ω0l = λsl vsl + λtl vtl . (17)
4
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

3.5. Iterative optimization algorithm 3.6. Training error analysis

In this section, we utilize AdaBoost to ensemble the obtained In this section, we theoretically analyse the training error of
classifiers to form a strong classifier based on the classification the AdaBoost in our proposed model. With the training error
error of the proposed method. analysis, we can prove that the proposed method is convergent.
In AdaBoost method, it first initializes the weight distribu- Based on the work [36], we then present the analysis of the
tion of the training data with the same weight, and then trains training error bound as follow.
the weak classifier. The specific training process is: if a certain
training example is accurately classified by the weak classifier, Algorithm 1 AdaBoost-based transfer learning method for PU
then in the next training process, its corresponding weight should learning.
be decreased; on the contrary, if a certain training example is Input: Source dataset Ss = {PSs , USs }, target dataset St =
misclassified, then its weight should be increased. The training set {PSt , USt }. The maximum number of classifiers: L.
with updated weights is used to train the next classifier, and the Output: The target classifier Ft .
entire training process proceeds iteratively in this way. Finally, 1: Initialize the weights of the target domain examples

combine the weak classifiers obtained from each training into a D1 = (d11 , · · · , d1i , · · · , d1|St | ), where d1i = |S1 | , i =
t
strong classifier. After the training process of each weak classifier 1, 2, · · · , |St |
is over, increase the weight of the weak classifier with a small 2: for l = 1 to L do
classification error rate to make it play a greater decisive role 3: Solute ω0l , vsl , vtl , b1l and b2l by (14);
in the final classification function, and reduce the weak classifier 4: Compute the lth classifier ftl (x) = (ω0l + vtl ) · φ (x) + b2l for
with a large classification error rate. The weight makes it play a the target task;
5: Use (18) to calculate the error rate of ftl (x):
smaller decisive role in the final classification function. In other
γ (xi , yi )
[∑ ]
words, a weak classifier with a low error rate occupies a larger #{PSt }l + #{NSt }l + USt l
el = ; (20)
weight in the final classifier, otherwise it is smaller. |St |
As we all know, PU learning problem is classified as weak-
Calculate the weights of the base classifier: µl = 1 1−e
supervised learning problem. It is not easy to evaluate the clas- 6:
2
lg e l ;
l
sification error in an iteration, however, AdaBoost method need 7: Based on (19), update the weights of examples:
to use the classification error in every iteration to adjust the Dl+1 = (dl+1,1 , · · · , dl+1,i , · · · , dl+1,|St | ),
parameters. Thus, we need to construct a formula to compute the where
classification error for PU learning problem. dl,i
Based on the similarity weight, we put forward the classifica- dl+1,i = exp(−µl yi ftl (xi )), i = 1, 2, · · · , |St |, (21)
Zl
tion error for PU learning as follow:
Zl is normalization factor
γ (xi , f (xi ))

#{PS } + #{NS } +
,
US |St |
e= (18) ∑
|S | Zl = exp(−µl yi ftl (xi )), (22)
i=1
where #{PS } and #{NS } is the number of misclassification points
For xi ∈ USt ,
in PS and NS class respectively, |S | is the total number of data set.
γ (x, f (x)) is a function define as follow: yi = sign(γ (xi , ftl (xi )) −
1
); (23)
2
1 − m (xi ),
{ −
f (xi ) = −1 end for
γ (xi , f (xi )) = 8:
1 − m+ (xi ), f (xi ) = +1 9: Build the linear combination of basic classifiers,
1 1 L
= (− f (xi ) + ) · (1 − m− (xi ))

ft (x) = µl ftl (x) (24)
2 2 l=1
1 1
+ ( f (xi ) + ) · (1 − m+ (xi )). (19) 10: return Final Classifier:
2 2 ( L )
The function above is the evaluation of classification error rate

Ft (x) = sign(ft (x)) = sign µl ftl (x) (25)
for ambiguous examples. If an ambiguous example xi is classified l=1

to positive class, the classification may assign the corresponding


label f (xi ) equal to +1. Then we can consider the classification
error rate is (1 − m+ (xi )), which is the similarity of xi towards Theorem 1. In AdaTLPU, the training error bound of the final strong
negative class. Note that, the more example xi similar to positive classifier is
class, the smaller the classification error rate (1 − m+ (xi )). Thus, |St | |St | L
we can have the classification error for the PU learning. 1 ∑ 1 ∑ ∏
I(Ft (xi ) ̸ = yi ) ≤ exp(−yi ft (xi )) = Zl (26)
Based on this, we present the iterative optimization algorithm |St | |St |
i=1 i=1 l=1
in Algorithm 1. All the algorithm steps are based on the AdaBoost
framework, in the Algorithm, step 1 initializes the weights of the in which Ft (x), ft (x) and Zl are obtain by (25), (24) and (22).
examples. And then, step 2 to step 8 is an iterative process to train
the weak classifier, which calculates the classification error of L Proof. if Ft (xi ) ̸ = yi , then yi ft (xi ) < 0, so exp(−yi ft (xi )) ≥ 1. Thus,
basic classifiers and obtains their weights. In step 5, we utilize
the proposed (18) formula to calculate the error rate of the PU |St |
1 ∑
|St |
1 ∑
learning classifier. In step 7, we update the weights of examples I(Ft (xi ) ̸ = yi ) ≤ exp(−yi ft (xi )). (27)
by formula (21). For ambiguous examples xi ∈ USt , we use (23) |St | |St |
i=1 i=1
to calculate its label yi . After this, in step 9, we combine the weak The formulation (21) can be written as
classifiers and obtain the classifier for positive and unlabelled
learning. dl,i exp(−µl yi ftl (xi )) = Zl dl+1,i . (28)
5
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

Then we have 4.1. Baselines


|St | |St | L
1 ∑ 1 ∑ ( ∑ )
exp(−yi ft (xi )) = exp − µl yi ftl (xi ) To investigate our AdaTLPU method, GLLC [20], PUFC [37],
|St | |St | TrAdaB [38], LMTTL [39], MADA [40], MDD [41] is chosen as the
i=1 i=1 l=1
|St | L baselines.
∑ ∏
= d1i exp(−µl yi ftl (xi )) 1. GLLC: this method constructs a classifier using local and
i=1 l=1 global ideas. In local aspect, the objective function intro-
|St | L duces regularization term to indicate the geometric prop-
∑ ∏
= Z1 d2i exp(−µl yi ftl (xi )) erty of each example. In global aspect, the method utilizes
i=1 l=2 a Biased Least Square Support Vector Machine(BLSSVM) for
|St |
∑ L
∏ PU learning examples.
= Z1 Z2 d3i exp(−µl yi ftl (xi )) 2. PUFC: this method selects reliable negative examples and
i=1 l=3 expands positive set based on fuzziness. By classification
= ··· fuzziness, PUFC uses data editing technique to filter out
|St |
noise points.
∑ 3. TrAdaB: this method combines the AdaBoost method with
= Z1 Z2 · · · ZL−1 dL,i exp(−µL yi ftL (xi ))
transfer learning method, which can customize a basic
i=1
classifier to obtain transfer learning classifier based on the
L
∏ AdaBoost.
= Zl . □ (29) 4. LMTTL: this SVM-based method proposes a kind of regu-
l=1 larization tern based on pairwise projective distribution in
Hilbert space.
Theorem 2 ([36]). In AdaTLPU, each weak classifier ftl has error el in 5. MADA: this neural network-based method presents a
lth iteration. Suppose the training error of the final strong classifier
multi-adversarial domain adaptation approach, which cap-
ft is e, then we have
tures multimode structures to enable fine-grained align-
L
∏ √ ment of different data distributions based on multiple
e≤ 2 el (1 − el ). domain discriminators.
l=1 6. MDD: this neural network-based method introduces a mar-
We then obtain gin disparity discrepancy measurement to solve the prob-
lem of unsupervised domain adaption, which is tailored to
Lemma 1. Let γl = 1
2
− el , γ > 0. When γl > γ (l = 1, 2, . . . , L), distribution comparison with asymmetric margin loss, and
we have minimax optimization.
e ≤ exp(−2Lγ 2 ). (30) In our experiment, we use F score as the measurement. F
score is weighted by precision = TP /(TP + FP) and recall =
TP /(TP + FN), the formula is:
Proof. Firstly, we have 2 × precision × recall
L L
F score =
∏ √ ∏√ (precision + recall)
2 el (1 − el ) = 1 − 4γl2 . (31)
where FP, FN and TP is false positive, false negative and true
L=1 l=1
positive respectively, the higher F value, the better performance.
√ m = −4γl , 0 < m < 1. According to the Taylor series of
2
Let
1 − x at x = 0, we have 4.2. Datasets
√ m m2 3m3 15m4

1 − 4γl2 = 1 − m = 1− − − − −· · ·− Rn (x), In order to study the properties of the proposed method, we
2 8 48 384 organize experiments on four real-world datasets:
(32)
1. 20 Newsgroups1 : 20 Newsgroups is an international stan-
in which Rn (x) = o(xn )(x → 0). According to the Taylor series of
dard datasets in text classification area. The dataset con-
exp(x) at x = 0, we have
tains text documents from 20 different newsgroups and
m m m2 m3 m4 each newsgroup is related to a topic. Some newsgroups
exp(−2γl2 ) = exp(− ) = 1− + − + + · · · + Rn (x) have very similar themes, while others are irrelevant.
2 2 8 48 384
2. Reuters-215782 : This dataset is a very important dataset in
(33)
document classification task. It contains 21 578 documents
Since it has (33)–(32) > 0, so that we can obtain: which can be grouped into 135 clusters based on different
L √ L topics.
3. MNIST3 : This is handwritten digits dataset, which contains
∏ ∑
1 − 4γl2 ≤ exp(−2 γl2 ) ≤ exp(−2Lγ 2 ). □ (34)
l=1 l=1
10,000 test instances and 60,000 training instances. This
dataset was hand-written by 250 different people, half of
4. Experiment result whom are from the Census Bureau and the other half from
high school students.
In this section, we organize experiments for the proposed
method to study its performance. The experiments were con-
1 http://people.csail.mit.edu/jrennie/20Newsgroups/.
ducted on a desktop with Windows 7 system, Intel Core i5-7500
3.40-GHz processor, 8 GB RAM, and the software environment is 2 http://www.daviddlewis.com/resources/testcollections/.
on Python 3.7. 3 http://yann.lecun.com/exdb/mnist.

6
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

Table 1
Experiment datasets.
Sub-datasets Configuration Source domain Target domain
Positive Negative Positive Negative
sub-dataset1 comp vs. R_comp comp sci, talk, rec comp sci, talk, rec
sub-dataset2 rec vs. R_rec rec sci, talk, comp rec sci, talk, comp
sub-dataset3 sci vs. R_sci sci rec, talk, comp sci rec, talk, comp
sub-dataset4 Orgs vs. R_Orgs Orgs People, Places Orgs People, Places
sub-dataset5 People vs. R_People People Orgs, Places People Orgs, Places
sub-dataset6 Places vs. R_Places Places Orgs, People Places Orgs, People
sub-dataset7 MNIST vs. USPS C1 in MNIST R1 in MNIST C1 in USPS R1 in USPS
sub-dataset8 USPS vs. MNIST C1 in USPS R1 in USPS C1 in MNIST R1 in MNIST
sub-dataset9 MNIST vs. USPS C2 in MNIST R2 in MNIST C2 in USPS R2 in USPS
sub-dataset10 USPS vs. MNIST C2 in USPS R2 in USPS C2 in MNIST R2 in MNIST
sub-dataset11 MNIST vs. USPS C3 in MNIST R3 in MNIST C3 in USPS R3 in USPS
sub-dataset12 USPS vs. MNIST C3 in USPS R3 in USPS C3 in MNIST R3 in MNIST
sub-dataset13 Amazon vs. DSLR A_bike A_phone D_bike D_phone
sub-dataset14 DSLR vs. Amazon D_bike D_phone A_bike A_phone
sub-dataset15 Webcam vs. Amazon W_bike W_phone A_bike A_phone
sub-dataset16 ImageNet vs. Pascal I_bottle I_chair P_bottle P_chair
sub-dataset17 Pascal vs. ImageNet P_bottle P_chair I_bottle I_chair
sub-dataset18 Caltech-256 vs. Pascal C_bottle C_chair P_bottle P_chair

4. USPS4 : This is a handwritten digital dataset, which con- as negative class. According to the existing three domains, we
tains 2007 test instances and 7291 training instances. Each have chosen three transfer way, namely A→D, D→A and W→A.
image is 16 × 16 grey pixels, similar to the MNIST dataset. For ImageCLEF-DA dataset, we operate it same as the Office-31
5. Office-31 [42]: This is an object recognition dataset, which dataset. We choose ‘‘bottle’’ class as the positive class and ‘‘chair’’
contains 31 object categories in three domains: Amazon(A), class as negative class. And we choose I→P, P→I and C→P as the
DSLR(D) and Webcam(W). The Amazon domain contains transfer way. The obtained sub-datasets are shown in Table 1.
on average 90 images per class and 2817 images in total. For each of the above datasets, we conduct the following
The DSLR domain has 498 images which contains 5 objects operations to obtain sub-datasets for the problem of positive and
per category. For Webcam, there are 795 images of low unlabelled learning. Firstly, we choose one certain category as the
resolution exhibit significant noise and colour as well as positive class and the remaining categories as the negative class.
white balance artefacts. We randomly select 10% examples in the positive class as the
6. ImageCLEF-DA [43]: This is a benchmark dataset for Im- labelled positive examples which is referred as PSs and PSt for
ageCLEF 2014 domain adaptation challenge, which contains source and target domain respectively. The remaining examples
three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I) in the positive class and the examples from other categories are
and Pascal VOC 2012 (P). For each domain, there are 12 used to form the unlabelled set which denote as USs and USt . We
categories and 50 images in each category. use r to denote the ratio of the training examples to the target
domain. In this experiment, the r value is initially set to 0.01.
Next, we introduce the configurations for the datasets. For 20
newsgroups dataset, following the work in [38], the dataset can 4.3. Setting of experiment
be separate into four newsgroups: ‘‘rec’’ newsgroup, ‘‘talk’’ news-
group, ‘‘sci’’ newsgroup and ‘‘comp’’ newsgroup. Each newsgroup The parameters in AdaTLPU and baselines is set as follow.
takes turns as the positive examples and the remaining three In GLLC method, parameter Cn , λ and RBF kernel parameter
newsgroups are used to generate the negative examples. The σ are chosen from {2−7 , 2−6 , . . . , 27 } and Cp is equal to 2Cn .
negative examples are denoted as R_comp, R_rec and R_sci. For In PUFC method, the parameter ϵ is chosen from {0, 0.05, 0.1,
Reuters-21578 dataset, we divide it into three groups: ‘‘Places’’, 0.15, . . . , 0.4, 0.45}. In LMTTL method, λ is chosen from {10−2 ,
‘‘People’’ and ‘‘Orgs’’. The same as the previous operation, each 10−1 , 100 , 101 , 102 }, C is chosen from {0.1, 0.5, 1.5, 10, 50, 100}
group takes turns as an positive examples and the remaining and the kernel function selects the linear function. In TrAdaB
three groups generate the negative examples. Then we have method, the base classifier is chosen using SVM and the number
negative examples: R_Places, R_People and R_Orgs. For MNIST of base classifiers is chosen from {1, 2, 3, . . . , 100}. In MADA
dataset, we first scale the size of each picture to 16 × 16 pixels, method, the parameter is fixed λ to 1. In MDD method, the
and then we generate three settings by randomly sampling 2000 asymptotic value of coefficient η is fixed to 0.1 and γ is chosen
instances in two different distributions respectively. We intend to from {2, 3, 4}. In our proposed method, the parameter Csl , Ctl are
use class number 0, 1 and 2 as positive examples. And we denote chosen from {1−3 , 10−2 , . . . , 103 }. The regularization parameters
them as C1 , C2 and C3 respectively. The notation R1 , R2 and R3 λsl , λtl are chosen from {10−4 , 10−3 , . . . , 102 , 103 }.
mean the remaining classes in the dataset apart from the corre-
sponding Ci class, i = 0, 1, 2. In addition, it is noteworthy that
4.4. Experimental results
we select the training examples randomly and we also keep the
number of examples of each class in balance. For USPS dataset,
In this section, we evaluate AdaTLPU method with other base-
we operate it same as the MNIST dataset. For Office-31 dataset,
lines. In order to avoid sampling error, we use five-fold cross-
we create three sub-datasets for experiment. For each dataset
validation and calculate the average performance for each dataset.
two classes are chosen, one as positive and other as negative. We
In Table 2, we show the performance and the standard deviation
then choose ‘‘bike’’ class as the positive class and ‘‘phone’’ class
of the compared methods.
In Table 2, the first column denotes the sub-dataset, the sec-
4 https://cs.nyu.edu/~roweis/data.html. ond to last column is the performance of each compared baseline.
7
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

Table 2
The performance of different competitive algorithms in different setting with r = 0.01. The best results are highlighted in bold.
Dataset GLLC PUFC TrAdaB LMTTL MADA MDD AdaTLPU
sub-dataset1 0.678 ± 0.045 0.701 ± 0.037 0.762 ± 0.012 0.718 ± 0.014 0.758 ± 0.130 0.768 ± 0.085 0.771 ± 0.018
sub-dataset2 0.663 ± 0.061 0.657 ± 0.043 0.739 ± 0.010 0.722 ± 0.004 0.701 ± 0.065 0.742 ± 0.034 0.755 ± 0.009
sub-dataset3 0.658 ± 0.057 0.661 ± 0.038 0.751 ± 0.015 0.738 ± 0.025 0.743 ± 0.022 0.769 ± 0.018 0.785 ± 0.013
sub-dataset4 0.695 ± 0.040 0.626 ± 0.032 0.714 ± 0.013 0.637 ± 0.005 0.733 ± 0.025 0.752 ± 0.182 0.760 ± 0.002
sub-dataset5 0.713 ± 0.032 0.644 ± 0.012 0.782 ± 0.015 0.718 ± 0.012 0.752 ± 0.031 0.778 ± 0.019 0.793 ± 0.017
sub-dataset6 0.682 ± 0.014 0.663 ± 0.031 0.321 ± 0.014 0.317 ± 0.008 0.706 ± 0.031 0.726 ± 0.028 0.768 ± 0.016
sub-dataset7 0.631 ± 0.012 0.684 ± 0.009 0.669 ± 0.015 0.641 ± 0.015 0.689 ± 0.078 0.701 ± 0.065 0.704 ± 0.014
sub-dataset8 0.511 ± 0.010 0.523 ± 0.013 0.546 ± 0.014 0.521 ± 0.008 0.571 ± 0.120 0.590 ± 0.051 0.608 ± 0.016
sub-dataset9 0.631 ± 0.021 0.684 ± 0.025 0.659 ± 0.035 0.638 ± 0.025 0.669 ± 0.058 0.689 ± 0.095 0.691 ± 0.011
sub-dataset10 0.523 ± 0.014 0.526 ± 0.023 0.560 ± 0.017 0.532 ± 0.018 0.581 ± 0.087 0.601 ± 0.105 0.593 ± 0.005
sub-dataset11 0.631 ± 0.022 0.684 ± 0.019 0.671 ± 0.009 0.648 ± 0.021 0.681 ± 0.121 0.705 ± 0.059 0.710 ± 0.013
sub-dataset12 0.501 ± 0.004 0.513 ± 0.007 0.552 ± 0.013 0.525 ± 0.019 0.563 ± 0.025 0.585 ± 0.081 0.589 ± 0.006
sub-dataset13 0.638 ± 0.011 0.659 ± 0.020 0.718 ± 0.007 0.702 ± 0.032 0.741 ± 0.125 0.759 ± 0.105 0.768 ± 0.015
sub-dataset14 0.513 ± 0.016 0.514 ± 0.024 0.548 ± 0.025 0.536 ± 0.045 0.560 ± 0.131 0.582 ± 0.200 0.595 ± 0.009
sub-dataset15 0.519 ± 0.003 0.525 ± 0.027 0.539 ± 0.014 0.530 ± 0.057 0.545 ± 0.125 0.559 ± 0.098 0.572 ± 0.011
sub-dataset16 0.631 ± 0.009 0.623 ± 0.017 0.658 ± 0.037 0.650 ± 0.046 0.680 ± 0.085 0.713 ± 0.023 0.717 ± 0.018
sub-dataset17 0.742 ± 0.019 0.759 ± 0.022 0.795 ± 0.028 0.781 ± 0.027 0.831 ± 0.103 0.826 ± 0.035 0.832 ± 0.006
sub-dataset18 0.556 ± 0.028 0.579 ± 0.030 0.619 ± 0.026 0.606 ± 0.014 0.639 ± 0.128 0.647 ± 0.058 0.662 ± 0.009

Table 3
The result of Wilcoxon Sign-Rank test with p-value and AdaTLPU differs highly
significant (p < 0.05) to those methods highlighted in bold.
Setting GLLC PUFC TrAdaB LMTTL MADA MDD
p-value 0.0002 0.0002 0.0002 0.0002 0.0002 0.0008

From Table 2, we discover that AdaTLPU method achieves the


best performance compared with the other baselines. First of all,
we discover that AdaTLPU performs better than GLCC and PUFC.
The reason is that the proposed method is based on transfer
learning approach in which the knowledge in the source domain
is transferred to the target domain and make the target domain
can build more accurate classifier. On the other hand, GLCC and
PUFC is a kind of single task learning method, which only learn a
classifier without the knowledge transferring process. Secondly,
we can notice that AdaTLPU yields higher accuracy than LMTTL
and TrAdaB. The main reason is that AdaTLPU method can make
better use of ambiguous examples based on similarity to con- Fig. 1. The error rate curves during iteration on the data sets.
struct PU learning classifier. With the similarity of ambiguous
examples, the decision boundary of the classifier can be refined,
which can improve the performance as well. Moreover, we can sub-dataset5 (People v s. R_People) sub-dataset7 (MNIST vs. USPS)
find that the accuracy of our method is slightly higher than MDD and sub-dataset13 (Amazon vs. DSLR) as examples. With r value
and MADA. The possible reason is that deep learning method gen- increasing from 0.01 to 0.5, we obtain the experiment results
erally has the phenomenon of overfitting, and the well training in Fig. 2. From the figures, we notice that as r increases, the
procedure can leads to the model over training, which potentially performance increases rapidly first, then the growth rate gradu-
reduces the performance of the test data. In addition, we intro-
ally slows down. In all, the performance of the AdaTLPU method
duce AdaBoost method into AdaTLPU, so that it can ensemble the
always increases with r value. This is because the target task
weak classifiers and achieve a better performance classifier for
data carry more information which can help the target task to
prediction. In conclusion, AdaTLPU method performs better than
build a better classifier. Moreover, with the change of r, AdaTLPU
each baseline method.
performs better than other baselines in most cases. This shows
In order to make further comparison of each method, we con-
that AdaTLPU method transfer knowledge very well in each
duct a statistical analysis of experiment results. Here, we adopt
dataset.
Wilcoxon Sign-Rank test [44,45] to compare different methods
on multiple datasets. Generally, as long as p-value is smaller than
0.05 (p < 0.05), we could consider that AdaTLPU method has a 4.6. Iteration analysis
significant difference from other baselines. We list the Wilcoxon
test value of AdaTLPU and each baseline over the datasets in In this section, we study convergence of AdaTLPU by the curve
Table 3. From Table 3, we can notice that the Wilcoxon test of error rate variation. We choose sub-dataset1 , sub-dataset2
results between AdaTLPU method and each baseline is smaller sub-dataset4 , sub-dataset5 , sub-dataset7 , sub-dataset8 ,
than 0.05. This indicates that AdaTLPU method is better than all sub-dataset13 and sub-dataset14 as examples and obtain the con-
the baselines from statistical perspective. vergence curves in Fig. 1. The horizontal axis denotes the number
of iterations, and vertical axis is the error rate of AdaTLPU. We
4.5. Accuracy analysis with different r discover that the eight curves on different datasets drop rapidly
in the first 10 iterations and then gradually converge to a stable
In this section, we explore that the performance varies with in- position. It can be seen that AdaTLPU always converges at or
creasing r values. We select sub-dataset2 (rec v s. R_rec), close to the stable performance points, where the rates of the
8
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

Fig. 2. The performance curves on several sub-datasets for five classifiers AdaTLPU, MDD, MADA, LMTTL, TrAdaB, PUFC, GLLC.

convergence are very fast. Moreover, AdaTLPU converges in less Acknowledgements


than 20 iterations on each sub-dataset. As a result, AdaTLPU is
efficient for all the sub-datasets. The authors would like to thank the reviewers for their very
useful comments and suggestions. This work was supported in
5. Conclusions and future work part by the Natural Science Foundation of China under Grant
62076074 and Grant 61876044, in part by Guangdong Basic and
In this paper, we propose an AdaBoost-based transfer learning Applied Basic Research Foundation, China Grant
method for solving PU learning problem. In our proposed method, 2020A1515010670 and 2020A1515011501, in part by the Science
by sharing SVM parameters and regularization terms, the source and Technology Planning Project of Guangzhou, China under
task knowledge is transferred to the target task. A new formula Grant 202002030141.
for calculating the classification error of the PU learning classifier
is presented to introduce the AdaBoost method. Further, we give Appendix. Proof of dual form
the iterative optimization method and the proof of the error
bound for AdaBoost method. At last, we organize experiment and
show that AdaTLPU method achieves the best performance in the Proof. By introducing Lagrangian multipliers αils+ ≥ 0, αjls− ≥ 0,
dataset.
αklt + ≥ 0, αhlt − ≥ 0, βils+ ≥ 0, βjls− ≥ 0, βklt + ≥ 0 and βhlt − ≥ 0, we
In the future, we plan to exploit an online learning algorithm
convert the objective function (13) into the Lagrangian function.
to learn the AdaTLPU classifier in the streaming data environ-
We then have the following into the Lagrangian function:
ment.
1 λsl λtl
L(ω0 , v, b, ξ , α, β ) = ∥ω0l ∥2 + ∥νsl ∥2 + ∥νtl ∥2
Declaration of competing interest 2 ∑ 2 ∑2
+ Csl m+
s (x i ) ξil + C sl m−s (xj )ξjl

The authors declare that they have no known competing finan- Ss+ Ss−
∑ ∑
cial interests or personal relationships that could have appeared + Ctl mt (xk )ξkl + Ctl
+
t (xh )ξhl
m−
to influence the work reported in this paper. St + St −

9
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

αils+ (ω0l + vsl )T · φ (xi ) + b1l + ξil − 1 0 ≤ αkl
t+
t (xk ), 0 ≤ αhl ≤ Ctl mt (xh ).
≤ Ctl m+ t− −
[ ]

Ss+
∑ Through the above work, the dual form is proved. □
αjls− − (ω0l + vsl )T · φ (xj )
[

Ss− References
− b1l + ξjl − 1
]
∑ [1] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S. Yu, Text
αklt + (ω0l + vtl )T · φ (xk ) + b2l + ξkl − 1
[ ]
− classification without negative examples revisit, IEEE Trans. Knowl. Data
Eng. 18 (1) (2006) 6–20.
St +
∑ [2] Bing Liu, Wee Sun Lee, Philip S. Yu, Xiaoli Li, Partially supervised classifi-
αhlt − − (ω0l + vtl )T · φ (xh ) cation of text documents, in: Claude Sammut, Achim G. Hoffmann (Eds.),
[

Machine Learning, Proceedings of the Nineteenth International Conference
St − (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12,
2002, Morgan Kaufmann, 2002, pp. 387–394.
−b2l + ξhl − 1
]
∑ ∑ [3] Xiaoli Li, Bing Liu, Learning to classify texts using positive and unla-
− βils+ ξil − βjls− ξjl beled data, in: Georg Gottlob, Toby Walsh (Eds.), IJCAI-03, Proceedings
of the Eighteenth International Joint Conference on Artificial Intelli-
Ss+ Ss− gence, Acapulco, Mexico, August 9-15, 2003, Morgan Kaufmann, 2003, pp.
∑ ∑ 587–594.
− β ξ −
t+
kl kl βhlt − ξhl . (A.1) [4] Xiaoli Li, Philip S. Yu, Bing Liu, See-Kiong Ng, Positive unlabeled learning
St + St − for data stream classification, in: Proceedings of the SIAM International
Conference on Data Mining, SDM 2009, April 30 - May 2, 2009, Sparks,
Differentiate the Lagrangian function (A.1) with ω0l , vsl , vtl , b1l , b2l , Nevada, USA, SIAM, 2009, pp. 259–270.
ξil , ξjl , ξkl and ξhl . The following equations are obtained: [5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, Philip S. Yu, Building text
∑ ∑ classifiers using positive and unlabeled examples, in: Proceedings of the
∇ω0l L = ω0l − αils+ φ (xi ) + αjls− φ (xj ) 3rd IEEE International Conference on Data Mining (ICDM 2003), 19-22
December 2003, Melbourne, Florida, USA, IEEE Computer Society, 2003,
Ss+ Ss−
pp. 179–188.
∑ ∑
− α φ (xk ) +
t+
kl αhlt − φ (xh ) = 0, (A.2) [6] Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang, PEBL: Web page clas-
sification without negative examples, IEEE Trans. Knowl. Data Eng. 16 (1)
St + St − (2004) 70–81.
∑ ∑ [7] Hong Shi, Shaojun Pan, Jian Yang, Chen Gong, Positive and unlabeled
∇vsl L = λs vsl − αils+ φ (xi ) + αjls− φ (xj ) = 0, (A.3) learning via loss decomposition and centroid estimation, in: Jérôme Lang
Ss+ Ss− (Ed.), Proceedings of the Twenty-Seventh International Joint Conference
∑ ∑ on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden,
∇vtl L = λt vtl − αklt + φ (xk ) + αhlt − φ (xh ) = 0, (A.4) ijcai.org, 2018, pp. 2689–2695.
St + St − [8] Chuang Zhang, Dexin Ren, Tongliang Liu, Jian Yang, Chen Gong, Positive
∑ ∑ and unlabeled learning with label disambiguation, in: Sarit Kraus (Ed.),
∇b1l L = − αils+ + αjls− = 0, (A.5) Proceedings of the Twenty-Eighth International Joint Conference on Arti-
ficial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org,
Ss+ Ss−
2019, pp. 4250–4256.
∑ ∑
[9] Charles Elkan, Keith Noto, Learning classifiers from only positive and
∇b2l L = − αklt + + αhlt − = 0, (A.6)
unlabeled data, in: Ying Li, Bing Liu, Sunita Sarawagi (Eds.), Proceedings of
St + St − the 14th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, ACM, 2008,
s (xi ) − (αil + µi ) = 0,
s+ s+
∇ξil L = Cs m+ (A.7) pp. 213–220.

s (xj ) − (αjl + µj ) = 0,
s− s− [10] Marthinus Christoffel du Plessis, Gang Niu, Masashi Sugiyama, Analysis of
∇ξjl L = Cs m− (A.8)
learning from positive and unlabeled data, in: Zoubin Ghahramani, Max
∇ξkl L = Ct mt (xk ) − (α
+ t+
kl +µ t+
k ) = 0, (A.9) Welling, Corinna Cortes, Neil D. Lawrence, Kilian Q. Weinberger (Eds.),
Advances in Neural Information Processing Systems 27: Annual Conference
∇ξhl L = Ct mt (xh ) − (α
− t−
hl +µ t−
h ) = 0. (A.10) on Neural Information Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada, 2014, pp. 703–711.
Eqs. (A.11)–(A.13) can be obtained by (A.2)–(A.4) [11] Noah Youngs, Dennis E. Shasha, Richard Bonneau, Positive-unlabeled learn-
ing in the face of labeling bias, in: IEEE International Conference on Data
1 ∑ s+ ∑ Mining Workshop, ICDMW 2015, Atlantic City, NJ, USA, November 14-17,
vsl = ( αil φ (xi ) + αjls− φ (xj )), (A.11)
λs 2015, IEEE Computer Society, 2015, pp. 639–645.
Ss+ Ss− [12] Shuangxun Ma, Ruisheng Zhang, PU-LP: A novel approach for positive
1 ∑ t+ ∑ and unlabeled learning by label propagation, in: 2017 IEEE International
vtl = ( αkl φ (xk ) + αhlt − φ (xh )), (A.12) Conference on Multimedia & Expo Workshops, ICME Workshops, Hong
λt Kong, China, July 10-14, 2017, IEEE Computer Society, 2017, pp. 537–542.
St + St −
[13] Jing Gao, Wei Fan, Jing Jiang, Jiawei Han, Knowledge transfer via multiple
ω0l = λs vsl + λt vtl . (A.13) model local structure mapping, in: Ying Li, Bing Liu, Sunita Sarawagi
(Eds.), Proceedings of the 14th ACM SIGKDD International Conference on
Substituting (A.5)–(A.10) into the Lagrangian function (A.1) can Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August
24-27, 2008, ACM, 2008, pp. 283–291.
obtain (A.14)
[14] Hiba Chougrad, Hamid Zouaki, Omar Alheyane, Multi-label transfer learn-
1 λs λt ing for the early diagnosis of breast cancer, Neurocomputing 392 (2020)
L(α, µ) = ∥ω0l ∥2 + ∥νs ∥2 +
∥νt ∥2 168–180.
2∑ 2 ∑ 2 [15] Xin Zheng, Luyue Lin, Bo Liu, Yanshan Xiao, Xiaoming Xiong, A multi-task
− αils+ (ω0l + vsl )T φ (xi ) + αjls− (ω0l + vsl )T φ (xj ) transfer learning method with dictionary learning, Knowl. Based Syst. 191
(2020) 105233.
Ss+ Ss−
∑ ∑ [16] Sotiris B. Kotsiantis, Bagging and boosting variants for handling
− α t+
kl (ω0l + vtl ) φ (xk ) +T
αhlt − (ω0l + vtl )T φ (xh ) classifications problems: a survey, Knowl. Eng. Rev. 29 (1) (2014) 78–100.
[17] Maxime Latulippe, Alexandre Drouin, Philippe Giguère, François Lavio-
St + St − lette, Accelerated robust point cloud registration in natural environments
through positive and unlabeled learning, in: Francesca Rossi (Ed.), IJCAI
∑ ∑ ∑ ∑
+ αs+
il + α s−
jl + α t+
kl + αhlt − . (A.14) 2013, Proceedings of the 23rd International Joint Conference on Artifi-
Ss+ Ss− St + St − cial Intelligence, Beijing, China, August 3-9, 2013, IJCAI/AAAI, 2013, pp.
2480–2487.
The range of αils+ , αjls− , αkl
t+
and αhl
t−
be derived by (A.7)–(A.10) [18] Slim Kanoun, Adel M. Alimi, Yves Lecourtier, Natural language morphology
integration in off-line arabic optical text recognition, IEEE Trans. Syst. Man
0 ≤ αils+ ≤ Csl m+
s (xi ), 0 ≤ αjl ≤ Csl ms (xj ),
s− −
Cybern. B 41 (2) (2011) 579–590.

10
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

[19] Wee Sun Lee, Bing Liu, Learning with positive and unlabeled examples [39] Brian Quanz, Jun Huan, Large margin transductive transfer learning, in:
using weighted logistic regression, in: Tom Fawcett, Nina Mishra (Eds.), David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, Jimmy J.
Machine Learning, Proceedings of the Twentieth International Conference Lin (Eds.), Proceedings of the 18th ACM Conference on Information and
(ICML 2003), August 21-24, 2003, Washington, DC, USA, AAAI Press, 2003, Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6,
pp. 448–455. 2009, ACM, 2009, pp. 1327–1336.
[20] Ting Ke, Ling Jing, Hui Lv, Lidong Zhang, Yaping Hu, Global and local [40] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, Jianmin Wang, Multi-
learning from positive and unlabeled examples, Appl. Intell. 48 (8) (2018) adversarial domain adaptation, in: Sheila A. McIlraith, Kilian Q. Weinberger
2373–2392. (Eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial
[21] Sundararajan Sellamanickam, Priyanka Garg, Sathiya Keerthi Selvaraj, A Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intel-
pairwise ranking based approach to learning with positive and unlabeled ligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances
examples, in: Craig Macdonald, Iadh Ounis, Ian Ruthven (Eds.), Proceedings in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February
of the 20th ACM Conference on Information and Knowledge Management, 2-7, 2018, AAAI Press, 2018, pp. 3934–3941.
CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, ACM, 2011,
[41] Yuchen Zhang, Tianle Liu, Mingsheng Long, Michael I. Jordan, Bridging the-
pp. 663–672.
[22] Marthinus Christoffel du Plessis, Gang Niu, Masashi Sugiyama, Convex ory and algorithm for domain adaptation, in: Kamalika Chaudhuri, Ruslan
formulation for learning from positive and unlabeled data, in: Francis R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on
Bach, David M. Blei (Eds.), Proceedings of the 32nd International Confer- Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA,
ence on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, in: in: Proceedings of Machine Learning Research, vol. 97, PMLR, 2019, pp.
JMLR Workshop and Conference Proceedings, vol. 37, JMLR.org, 2015, pp. 7404–7413.
1386–1394. [42] Piotr Koniusz, Yusuf Tas, Fatih Porikli, Domain adaptation by mixture
[23] Sinno Jialin Pan, Qiang Yang, A survey on transfer learning, IEEE Trans. of alignments of second-or higher-order scatter tensors, in: 2017 IEEE
Knowl. Data Eng. 22 (10) (2010) 1345–1359. Conference on Computer Vision and Pattern Recognition, CVPR 2017,
[24] Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, Ruixin Zhu, Transfer learning Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, 2017, pp.
on heterogenous feature spaces via spectral transformation, in: Geoffrey I. 7139–7148.
Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, Xindong Wu (Eds.), [43] Yabin Zhang, Hui Tang, Kui Jia, Mingkui Tan, Domain-symmetric networks
ICDM 2010, the 10th IEEE International Conference on Data Mining, for adversarial domain adaptation, in: IEEE Conference on Computer Vision
Sydney, Australia, 14-17 December 2010, IEEE Computer Society, 2010, and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,
pp. 1049–1054. 2019, Computer Vision Foundation / IEEE, 2019, pp. 5031–5040.
[25] Hua-Yan Wang, Qiang Yang, Transfer learning by structural analogy, in: [44] Ben Derrick, Paul White, Comparing two samples from an individual Likert
Wolfram Burgard, Dan Roth (Eds.), Proceedings of the Twenty-Fifth AAAI question, Int. J. Math. Stat. 18 (2017).
Conference on Artificial Intelligence, AAAI 2011, San Francisco, California,
[45] Pratt, W. John, Remarks on zeros and ties in the Wilcoxon signed rank
USA, August 7-11, 2011, AAAI Press, 2011.
[26] Jing Jiang, ChengXiang Zhai, Instance weighting for domain adaptation procedures, Publ. Am. Statal Assoc. 54 (287) (1959) 655–667.
in NLP, in: John A. Carroll, Antal van den Bosch, Annie Zaenen (Eds.),
ACL 2007, Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, The
Association for Computational Linguistics, 2007. Bo Liu is a professor with the Faculty of Automa-
[27] Xuejun Liao, Ya Xue, Lawrence Carin, Logistic regression with an auxiliary tion, Guangdong University of Technology. His research
data source, in: Luc De Raedt, Stefan Wrobel (Eds.), Machine Learning, interests include support vector machine, feature ex-
Proceedings of the Twenty-Second International Conference (ICML 2005), traction, clustering. He has published papers on IEEE
Bonn, Germany, August 7-11, 2005, in: ACM International Conference Transactions on Neural Networks, IEEE Transactions
Proceeding Series, vol. 119, ACM, 2005, pp. 505–512. on Knowledge and Data Engineering, Knowledge and
[28] Pengcheng Wu, Thomas G. Dietterich, Improving SVM accuracy by training Information Systems, IEEE International Conference on
on auxiliary data sources, in: Carla E. Brodley (Ed.), Machine Learning, Data Mining (ICDM), SIAM International Conference on
Proceedings of the Twenty-First International Conference (ICML 2004),
Data Mining (SDM) and ACM International Conference
Banff, Alberta, Canada, July 4-8, 2004, in: ACM International Conference
on Information and Knowledge Management (CIKM).
Proceeding Series, vol. 69, ACM, 2004.
[29] Tony Jebara, Multi-task feature and kernel selection for SVMs, in: Carla E.
Brodley (Ed.), Machine Learning, Proceedings of the Twenty-First Interna-
tional Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, in: Changdong Liu is obtaining his Master degree at the
ACM International Conference Proceeding Series, vol. 69, ACM, 2004. Department of Automation, Guangdong University of
[30] Su-In Lee, Vassil Chatalbashev, David Vickrey, Daphne Koller, Learning Technology. His research interests include positive and
a meta-level prior for feature relevance from multiple related tasks, in: unlabelled learing and boosting methods.
Zoubin Ghahramani (Ed.), Machine Learning, Proceedings of the Twenty-
Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June
20-24, 2007, in: ACM International Conference Proceeding Series, vol. 227,
ACM, 2007, pp. 489–496.
[31] Theodoros Evgeniou, Massimiliano Pontil, Regularized multi–task learning,
in: Won Kim, Ron Kohavi, Johannes Gehrke, William DuMouchel (Eds.),
Proceedings of the Tenth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, Seattle, Washington, USA, August 22-25,
2004, ACM, 2004, pp. 109–117.
Yanshan Xiao received the Ph.D. degree in computer
[32] Lilyana Mihalkova, Tuyen N. Huynh, Raymond J. Mooney, Mapping and
science from the Faculty of Engineering and In for-
revising Markov logic networks for transfer learning, in: Proceedings of
the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, mation Technology, University of Technology, Sydney,
2007, Vancouver, British Columbia, Canada, AAAI Press, 2007, pp. 608–614. Australia, in 2011. She is with the Faculty of Computer,
[33] Matthew Richardson, Pedro M. Domingos, Markov logic networks, Mach. Guangdong University of Technology. Her research in-
Learn. 62 (1–2) (2006) 107–136. terests include data mining and machine learning. She
[34] Chris Buckley, Gerard Salton, James Allan, The effect of adding relevance has published papers on IEEE PAMI, IEEE Transac-
information in a relevance feedback environment, in: W. Bruce Croft, tions on Neural Networks and Learning Systems, IEEE
C.J. van Rijsbergen (Eds.), Proceedings of the 17th Annual International Transactions on Cybernetics, Knowledge and Informa-
ACM-SIGIR Conference on Research and Development in Information tion Systems, and International Joint Conferences on
Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), Artificial Intelligence (IJCAI).
ACM/Springer, 1994, pp. 292–300.
[35] Vladimir Vapnik, The Nature of Statistical Learning Theory, Springer, 2013.
[36] Yoav Freund, Robert E. Schapire, A decision-theoretic generalization of on- Laiwang Liu is obtaining his Master degree at the
line learning and an application to boosting, J. Comput. System Sci. 55 (1) Department of Automation, Guangdong University of
(1997) 119–139. Technology. His research interests include data mining
[37] TingTing Li, WeiYa Fan, YunSong Luo, A method on selecting reliable and machine learning.
samples based on fuzziness in positive and unlabeled learning, 2019, CoRR,
abs/1903.11064.
[38] Wenyuan Dai, Qiang Yang, Gui-Rong Xue, Yong Yu, Boosting for transfer
learning, in: Zoubin Ghahramani (Ed.), Machine Learning, Proceedings of
the Twenty-Fourth International Conference (ICML 2007), Corvallis, Ore-
gon, USA, June 20-24, 2007, in: ACM International Conference Proceeding
Series, vol. 227, ACM, 2007, pp. 193–200.

11
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162

Weibin Li is obtaining his Master degree at the Xiaodong Chen is obtaining his Master degree at the
Department of Automation, Guangdong University of Department of Automation, Guangdong University of
Technology. His research interests include data mining Technology. His research interests include data mining
and machine learning. and machine learning.

12

You might also like