Professional Documents
Culture Documents
AdaBoost Based Transfer Learning Method For Positive An 2022 Knowledge Based
AdaBoost Based Transfer Learning Method For Positive An 2022 Knowledge Based
AdaBoost Based Transfer Learning Method For Positive An 2022 Knowledge Based
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
article info a b s t r a c t
Article history: Positive and unlabelled learning (PU learning) is a problem that the training of a classifier only utilizes
Received 23 January 2021 labelled positive examples and unlabelled examples. Recently, PU learning has been widely studied
Received in revised form 28 June 2021 and used in a number of areas. In this paper, we present an AdaBoost-based transfer learning method
Accepted 3 January 2022
to solve PU Learning problem, which is briefly called AdaTLPU. In the proposed model, by sharing
Available online 20 January 2022
SVM parameters and regularization terms, the source task knowledge is transferred to the target
Keywords: task. At the same time, the similarity of the ambiguous examples towards the positive and negative
Transfer learning classes is taken into account to refine the decision boundary of the classifier. Meanwhile, we adopt the
PU learning AdaBoost method to ensemble the obtained weak classifiers to form a strong classifier for prediction.
AdaBoost method In addition, we put forward an iterative optimization method to obtain the classifier and present the
proof of training error bound for the proposed method. Finally, we organize experiments to explore
the performance of AdaTLPU and the results indicate that AdaTLPU can achieve the better performance
compared with previous PU learning methods.
© 2022 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.knosys.2022.108162
0950-7051/© 2022 Elsevier B.V. All rights reserved.
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
early diagnosis of breast cancer and introduce transfer learning examples and unlabelled examples to solve this problem. For PU
to multi-label learning based on a fine-tuning strategy. Zheng learning, a lot of researches have been done and we can roughly
et al. [15] propose a dictionary-based transfer learning method, divide the previous methods into following three categories.
which utilizes the input examples to obtain sparse representation The first category is two-step strategy [2,3,5,19], which first
and then trains the classifier with the obtained sparse represen- uses the unlabelled set to extract positive or negative examples
tation. In addition, boosting has been proposed to construct a and then trains classifier with the reliable negative examples
strong classifier by ensembling several weak classifiers [16]. This and the original positive examples. For example, Liu et al. [2]
can be beneficial for constructing a strong classifier when only propose a method called S-EM, which builds an initial classifier
a small number of training data is available. Moreover, boosting to select those documents that are most likely to be negative
can reduce the influence of the individual classifier and integrate documents from the mixed dataset. After that, they utilize EM
the weak classifiers into a robust classifier. Considering transfer algorithm to obtain a good local maximum of the likelihood func-
learning has been widely studied, it is essential to consider the tion. Li et al. [3] use Rocchio method to extract reliable negative
problem of transfer learning-based PU learning, in which boosting examples and SVM technique for classifier building.
is considered to ensemble a strong classifier for prediction. The second category is biased PU learning method [5,7,20]
In the paper, we solve the problem of transfer learning with that treats the unlabelled examples as negatives examples with
positive and unlabelled data. To build the learning model, we class label noise. These methods will place higher penalties on
mainly face two challenges. Firstly, when positive and unlabelled misclassified positive examples in order to overcome the class
data from the source and target domains can be obtained, how to label noise. Ke et al. [20] put forward a PU learning classifier
use them to guide the model to learn a classifier for prediction. by combining biased least square SVM (BLSSVM) and a smooth
Secondly, how to construct the strong transfer learning-based regularization term. Sellamanickam et al. [21] propose a new
classifier by integrating the weak transfer learning classifiers with SVM-based approach, which includes a margin-based dual loss
positive and unlabelled data. In order to resolve the above two function. In the approach, the positive and negative class ex-
challenges, we propose a novel approach, called AdaBoost-based amples obtain the penalty, and the method uses an adjustment
transfer learning method for PU learning problem(AdaTLPU). We formula to set the threshold and regularization parameter in the
build the transfer learning model based on shared parameter in objective function. However, in this kind methods, it is challeng-
the SVM and the regularization terms are used to transfer the ing to place proper penalties for the label noise. If the penalties
knowledge from the source task. In the proposed method, we first are not selected properly, this will reduce the performance of the
identify reliable negative examples from unlabelled data for the methods.
target task and source task respectively, and then calculate the The last category suggests that PU classification can be cast
similarity for the ambiguous examples for the rest data of the as a cost-sensitive learning, which regards unlabelled examples
unlabelled data. We then build the transfer learning-based PU as negative by reweighting penalty to each class [9–11]. For
model which incorporates the positive and reliable examples, am- example, Du Plessis et al. [10] first show that the PU learning
biguous examples with similarity weights into the learning. With problem can be solved by cost-sensitive learning and find that
the derived classification errors, we can adopt AdaBoost method using non-convex loss functions can avoid leading to a wrong
to ensemble the weak classifiers to build a strong classifier. classification boundary. The work in [22] introduces a more gen-
In all, the novelty of this paper is that it is the first time eral risk estimator to overcome the non-convexity, which uses
that transfer learning and AdaBoost are incorporated into positive ordinary convex loss function and composite loss function for
and unlabelled learning in the best of our knowledge. At the the unlabelled examples and positive examples. Although these
same time, we incorporate the similarity values for the ambigu- methods work well in practice, their performance depends largely
ous examples contained in the unlabelled set into the transfer on the reliability of the loss function for the PU data. In fact,
learning-based classifier so that the ambiguous examples can these methods are unreliable in the general cases, which limits
contribute to the classifier construction according to their sim- their applications in practice. In contrast, our proposed method
ilarity values. As a result, we can obtain a superior classifier for is model-based and the objective function is theoretically based
prediction. Further, there also exists a challenge to solve the pro- on the principle of structural risk minimization and empirical risk
posed model. We then utilize the quadratic programming method minimization.
to solve the AdaTLPU model efficiently and analyse the conver- For positive and unlabelled learning problem, the previous
gence of the proposed method based on the derived training work always utilizes the knowledge of single domain to build
errors. the classifier for prediction. However, in many positive and un-
The rest of this paper is organized as follow. Section 2 intro- labelled learning conditions, the model falls to achieve good
duces the related work. Section 3 proposes the AdaTLPU method performance due to the little number of examples. Thus, in order
and obtains an optimization algorithm to solve the presented to improve the performance of the model, we can use transfer
model. Section 4 shows the experiments and results. Section 5 learning to transfer domain knowledge with sufficient examples
concludes the paper. to the target domain with insufficient examples so that the clas-
sifier built on the target domain can have superior prediction
2. Related work results.
With the rapid development of information technology, hu- In the recent years, transfer learning [23–25] in data mining
man produce massive data all the time. In some areas, like in- and machine learning become an important topic. Different from
formation retrieval [9,17] and text classification [1,18], people the traditional machine learning for one task, transfer learning
often have some positive data and unlabelled data. Meanwhile, makes use of source domain data and target domain data to train
it is expensive and laborious to label data manually. Due to the model. For the previous transfer learning methods, they can be
reason, positive and unlabelled learning has draw more and more roughly classified into following four categories.
attention in recent years. Positive and unlabelled learning is a The first approach is instance-transfer [26–28], which is based
kind of binary classification problem, which utilizes only positive on the precondition that the source task data and target task
2
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
data have certain part data similar. In practice, it often re-weights method in Section 3.3. In Section 3.4, we solve the presented
some source domain data and then utilizes them in target do- SVM-based problem and obtain the dual problem. With the dual
main. Jiang et al. [26] propose a new method based on conditional problem, we ensemble the base classifiers by AdaBoost method
probability, which is to eliminate the incorrect training examples to form a strong classifier.
from the source task data. Liao et al. [27] propose an active
learning method, which utilizes the source task data to choose 3.1. Reliable negative examples extraction
unlabelled data in the target task. Wu et al. [28] assume the
source domain data come from different distributions and present We assume that Ss and St denote the source domain and
a new SVM-based method to improve accuracy of the classifier. target domain respectively, and each domain contains positive
This kind of approaches is based on instance to transfer knowl- and unlabelled dataset. For the source and target domain, we
edge, which needs to measure the similarity and assign weight to have Ss = {PSs , USs } and St = {PSt , USt }. Here, PSs and PSt denote
the instances used to transfer knowledge. And this makes it less the labelled positive examples for two domains, and USs and USt
efficient than parameter-based transfer learning. represent the unlabelled examples for two domains.
The second approach is feature-representation-transfer [29, We extract the reliable negative examples in the source do-
30], which needs to obtain appropriate feature representation main and the target domain and put them in subset NSs and
that can let the divergences between the source and the target NSt respectively. We take the source domain as an example to
tasks as small as possible. According to whether the source task explain this process, and the target domain operates the same
data is labelled or not, they can be divided into supervised and way. Firstly, we utilize both Rocchio technique [3] and Spy tech-
unsupervised feature construction. Jebara [29] proposes a SVM- nique [2] to obtain the most reliable negative examples, and put
based multi-task learning method, which is used in multiple them into subsets Ss1 and Ss2 respectively. We believe that the ex-
classification tasks under different labelled datasets conditions. amples agreed by both techniques are most the reliable negative
In addition, Lee et al. [30] propose a multi-task learning method, examples. That is, NSs = Ss1 ∩ Ss2 . For Rocchio technique [3], It first
which can learn feature weights and meta-priors from each task assigns the unlabelled set U to the negative class, and positive
simultaneously and each task utilizes the meta-priors to transfer set P to the positive class. Then, it uses x to denote one certain
the knowledge. This kind of methods is more dependent on example, and we can calculate the positive prototype vector c +
feature representation and feature selection. Once there is no and the negative prototype vector c − :
suitable feature to transfer knowledge, it will not obtain desirable
1 ∑ x 1 ∑ x
performance. c+ = α −β , (1)
The third approach is parameter-transfer [13,31], which is to |P | ∥x∥ |U | ∥x∥
x∈P x∈U
share prior distributions or some hyperparameters in the model. 1 ∑ x 1 ∑ x
In the work [31], Evgeniou et al. propose transfer learning method c− = α −β . (2)
|U | ∥x∥ |P | ∥x∥
for SVM, which transfers the knowledge of source task to target x∈U x∈P
task via common parameters. Gao et al. [13] propose an ensemble
Finally, for each example x in unlabelled set U, if Sim(c + , x) ≤
transfer learning framework, in which the weights are depend on
Sim(c − , x) then, put the example x into negative set N, and the
the predictive ability model.
examples in set N are the reliable negative examples. For Spy
The last approach is relational-knowledge-transfer [32,33],
technique [2], it first puts some labelled examples(Spies) into
which does not assume that the data are drawn from each
unlabelled set to obtain spy data. And then, the technique utilizes
domain by independently and identically distributed. This kind of
the spy data to determine a threshold t. With the threshold t,
methods always builds the map of relational knowledge between
we can use it to estimate the most likely negative examples. In
source task and target task. For example, Mihalkova et al. [32]
practice, for a specific example x, if the probability satisfies the
propose an approach based on Markov Logic Networks(MLNs),
following equation:
which can transfer relational knowledge across the relational do-
mains. Richardson et al. [33] present a method to integrate First- Pr[c − |x] < t , (3)
Order Logic(FOL) and Probabilistic Graphical Models(PGM) in a −
united representation for statistical relational learning. Although we can assume x belongs to the negative class c . Then, put the
this kind of approaches does not require the data distribution to example x into negative set N, and the examples in set N are the
be independently and identically distributed, it needs to build a reliable negative examples. For more details of Rocchio technique
relational map between source task and target task which leads and Spy technique, one can refer to [2,3].
to the limitation of this approach. After the reliable negative examples are obtained, we remove
Although transfer learning has always received much atten- them from the unlabelled data subset, i.e., USs = USs − NSs .
tion. Most of the transfer learning methods aim at learning with Similarly, we can obtain set NSt and USt .
certain labels. They do not explicitly solve the problem of transfer
learning with positive and unlabelled learning and do not transfer 3.2. Similarity weight generation
knowledge from the source task to the target task where both
tasks contain unlabelled examples. In this paper, we propose a After the above step, the datasets are separated into three
transfer learning-based framework to address the positive and pairs of subsets: positive sets PSs , PSt , reliable negative sets NSs ,
unlabelled learning problem, and build the predictive classifier NSt and ambiguous sets USs , USt . So as to make full use of the
for the target domain by transferring knowledge from the source data in the USs , USt , we put forward the similarity-based data
domain. model and calculate the corresponding similarity values in the
method. For the ambiguous example x in the sets USs and USt ,
3. The proposed method the similarity-based data model is presented as follows:
source and target domains respectively. Take source domain as PSs ∪USs NSs ∪USs
( )]
an example, we firstly use K-mean clustering to cluster the exam- ∑ ∑
ples in NSs into m micro-clusters, denoted as NSs1 , NSs2 , . . . , NSsm + Ctl mt (xk )ξkl +
+
mt (xh )ξhl
−
,
where ms = t |NSs |/(|USs | + |NSs |) and t is set to be 30 in the PSt ∪USt NSt ∪USt
ωs = ω0 + vs , 0 ≤ αils+ ≤ Cs m+
s (xi ), 0 ≤ αjl ≤ Cs ms (xj )
(11) s.t. s− −
is formulated as follows: 1 ∑ t+ ∑
vtl = ( αkl φ (xk ) + αhlt − φ (xh )), (16)
L λtl
λsl λtl
[
∑ 1 St + St −
min µl ∥ω0l ∥ +2 2
∥νsl ∥ + ∥νtl ∥ 2
ω0 ,ν,b,ξ ,µ
l=1
2 2 2 ω0l = λsl vsl + λtl vtl . (17)
4
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
In this section, we utilize AdaBoost to ensemble the obtained In this section, we theoretically analyse the training error of
classifiers to form a strong classifier based on the classification the AdaBoost in our proposed model. With the training error
error of the proposed method. analysis, we can prove that the proposed method is convergent.
In AdaBoost method, it first initializes the weight distribu- Based on the work [36], we then present the analysis of the
tion of the training data with the same weight, and then trains training error bound as follow.
the weak classifier. The specific training process is: if a certain
training example is accurately classified by the weak classifier, Algorithm 1 AdaBoost-based transfer learning method for PU
then in the next training process, its corresponding weight should learning.
be decreased; on the contrary, if a certain training example is Input: Source dataset Ss = {PSs , USs }, target dataset St =
misclassified, then its weight should be increased. The training set {PSt , USt }. The maximum number of classifiers: L.
with updated weights is used to train the next classifier, and the Output: The target classifier Ft .
entire training process proceeds iteratively in this way. Finally, 1: Initialize the weights of the target domain examples
combine the weak classifiers obtained from each training into a D1 = (d11 , · · · , d1i , · · · , d1|St | ), where d1i = |S1 | , i =
t
strong classifier. After the training process of each weak classifier 1, 2, · · · , |St |
is over, increase the weight of the weak classifier with a small 2: for l = 1 to L do
classification error rate to make it play a greater decisive role 3: Solute ω0l , vsl , vtl , b1l and b2l by (14);
in the final classification function, and reduce the weak classifier 4: Compute the lth classifier ftl (x) = (ω0l + vtl ) · φ (x) + b2l for
with a large classification error rate. The weight makes it play a the target task;
5: Use (18) to calculate the error rate of ftl (x):
smaller decisive role in the final classification function. In other
γ (xi , yi )
[∑ ]
words, a weak classifier with a low error rate occupies a larger #{PSt }l + #{NSt }l + USt l
el = ; (20)
weight in the final classifier, otherwise it is smaller. |St |
As we all know, PU learning problem is classified as weak-
Calculate the weights of the base classifier: µl = 1 1−e
supervised learning problem. It is not easy to evaluate the clas- 6:
2
lg e l ;
l
sification error in an iteration, however, AdaBoost method need 7: Based on (19), update the weights of examples:
to use the classification error in every iteration to adjust the Dl+1 = (dl+1,1 , · · · , dl+1,i , · · · , dl+1,|St | ),
parameters. Thus, we need to construct a formula to compute the where
classification error for PU learning problem. dl,i
Based on the similarity weight, we put forward the classifica- dl+1,i = exp(−µl yi ftl (xi )), i = 1, 2, · · · , |St |, (21)
Zl
tion error for PU learning as follow:
Zl is normalization factor
γ (xi , f (xi ))
∑
#{PS } + #{NS } +
,
US |St |
e= (18) ∑
|S | Zl = exp(−µl yi ftl (xi )), (22)
i=1
where #{PS } and #{NS } is the number of misclassification points
For xi ∈ USt ,
in PS and NS class respectively, |S | is the total number of data set.
γ (x, f (x)) is a function define as follow: yi = sign(γ (xi , ftl (xi )) −
1
); (23)
2
1 − m (xi ),
{ −
f (xi ) = −1 end for
γ (xi , f (xi )) = 8:
1 − m+ (xi ), f (xi ) = +1 9: Build the linear combination of basic classifiers,
1 1 L
= (− f (xi ) + ) · (1 − m− (xi ))
∑
ft (x) = µl ftl (x) (24)
2 2 l=1
1 1
+ ( f (xi ) + ) · (1 − m+ (xi )). (19) 10: return Final Classifier:
2 2 ( L )
The function above is the evaluation of classification error rate
∑
Ft (x) = sign(ft (x)) = sign µl ftl (x) (25)
for ambiguous examples. If an ambiguous example xi is classified l=1
6
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
Table 1
Experiment datasets.
Sub-datasets Configuration Source domain Target domain
Positive Negative Positive Negative
sub-dataset1 comp vs. R_comp comp sci, talk, rec comp sci, talk, rec
sub-dataset2 rec vs. R_rec rec sci, talk, comp rec sci, talk, comp
sub-dataset3 sci vs. R_sci sci rec, talk, comp sci rec, talk, comp
sub-dataset4 Orgs vs. R_Orgs Orgs People, Places Orgs People, Places
sub-dataset5 People vs. R_People People Orgs, Places People Orgs, Places
sub-dataset6 Places vs. R_Places Places Orgs, People Places Orgs, People
sub-dataset7 MNIST vs. USPS C1 in MNIST R1 in MNIST C1 in USPS R1 in USPS
sub-dataset8 USPS vs. MNIST C1 in USPS R1 in USPS C1 in MNIST R1 in MNIST
sub-dataset9 MNIST vs. USPS C2 in MNIST R2 in MNIST C2 in USPS R2 in USPS
sub-dataset10 USPS vs. MNIST C2 in USPS R2 in USPS C2 in MNIST R2 in MNIST
sub-dataset11 MNIST vs. USPS C3 in MNIST R3 in MNIST C3 in USPS R3 in USPS
sub-dataset12 USPS vs. MNIST C3 in USPS R3 in USPS C3 in MNIST R3 in MNIST
sub-dataset13 Amazon vs. DSLR A_bike A_phone D_bike D_phone
sub-dataset14 DSLR vs. Amazon D_bike D_phone A_bike A_phone
sub-dataset15 Webcam vs. Amazon W_bike W_phone A_bike A_phone
sub-dataset16 ImageNet vs. Pascal I_bottle I_chair P_bottle P_chair
sub-dataset17 Pascal vs. ImageNet P_bottle P_chair I_bottle I_chair
sub-dataset18 Caltech-256 vs. Pascal C_bottle C_chair P_bottle P_chair
4. USPS4 : This is a handwritten digital dataset, which con- as negative class. According to the existing three domains, we
tains 2007 test instances and 7291 training instances. Each have chosen three transfer way, namely A→D, D→A and W→A.
image is 16 × 16 grey pixels, similar to the MNIST dataset. For ImageCLEF-DA dataset, we operate it same as the Office-31
5. Office-31 [42]: This is an object recognition dataset, which dataset. We choose ‘‘bottle’’ class as the positive class and ‘‘chair’’
contains 31 object categories in three domains: Amazon(A), class as negative class. And we choose I→P, P→I and C→P as the
DSLR(D) and Webcam(W). The Amazon domain contains transfer way. The obtained sub-datasets are shown in Table 1.
on average 90 images per class and 2817 images in total. For each of the above datasets, we conduct the following
The DSLR domain has 498 images which contains 5 objects operations to obtain sub-datasets for the problem of positive and
per category. For Webcam, there are 795 images of low unlabelled learning. Firstly, we choose one certain category as the
resolution exhibit significant noise and colour as well as positive class and the remaining categories as the negative class.
white balance artefacts. We randomly select 10% examples in the positive class as the
6. ImageCLEF-DA [43]: This is a benchmark dataset for Im- labelled positive examples which is referred as PSs and PSt for
ageCLEF 2014 domain adaptation challenge, which contains source and target domain respectively. The remaining examples
three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I) in the positive class and the examples from other categories are
and Pascal VOC 2012 (P). For each domain, there are 12 used to form the unlabelled set which denote as USs and USt . We
categories and 50 images in each category. use r to denote the ratio of the training examples to the target
domain. In this experiment, the r value is initially set to 0.01.
Next, we introduce the configurations for the datasets. For 20
newsgroups dataset, following the work in [38], the dataset can 4.3. Setting of experiment
be separate into four newsgroups: ‘‘rec’’ newsgroup, ‘‘talk’’ news-
group, ‘‘sci’’ newsgroup and ‘‘comp’’ newsgroup. Each newsgroup The parameters in AdaTLPU and baselines is set as follow.
takes turns as the positive examples and the remaining three In GLLC method, parameter Cn , λ and RBF kernel parameter
newsgroups are used to generate the negative examples. The σ are chosen from {2−7 , 2−6 , . . . , 27 } and Cp is equal to 2Cn .
negative examples are denoted as R_comp, R_rec and R_sci. For In PUFC method, the parameter ϵ is chosen from {0, 0.05, 0.1,
Reuters-21578 dataset, we divide it into three groups: ‘‘Places’’, 0.15, . . . , 0.4, 0.45}. In LMTTL method, λ is chosen from {10−2 ,
‘‘People’’ and ‘‘Orgs’’. The same as the previous operation, each 10−1 , 100 , 101 , 102 }, C is chosen from {0.1, 0.5, 1.5, 10, 50, 100}
group takes turns as an positive examples and the remaining and the kernel function selects the linear function. In TrAdaB
three groups generate the negative examples. Then we have method, the base classifier is chosen using SVM and the number
negative examples: R_Places, R_People and R_Orgs. For MNIST of base classifiers is chosen from {1, 2, 3, . . . , 100}. In MADA
dataset, we first scale the size of each picture to 16 × 16 pixels, method, the parameter is fixed λ to 1. In MDD method, the
and then we generate three settings by randomly sampling 2000 asymptotic value of coefficient η is fixed to 0.1 and γ is chosen
instances in two different distributions respectively. We intend to from {2, 3, 4}. In our proposed method, the parameter Csl , Ctl are
use class number 0, 1 and 2 as positive examples. And we denote chosen from {1−3 , 10−2 , . . . , 103 }. The regularization parameters
them as C1 , C2 and C3 respectively. The notation R1 , R2 and R3 λsl , λtl are chosen from {10−4 , 10−3 , . . . , 102 , 103 }.
mean the remaining classes in the dataset apart from the corre-
sponding Ci class, i = 0, 1, 2. In addition, it is noteworthy that
4.4. Experimental results
we select the training examples randomly and we also keep the
number of examples of each class in balance. For USPS dataset,
In this section, we evaluate AdaTLPU method with other base-
we operate it same as the MNIST dataset. For Office-31 dataset,
lines. In order to avoid sampling error, we use five-fold cross-
we create three sub-datasets for experiment. For each dataset
validation and calculate the average performance for each dataset.
two classes are chosen, one as positive and other as negative. We
In Table 2, we show the performance and the standard deviation
then choose ‘‘bike’’ class as the positive class and ‘‘phone’’ class
of the compared methods.
In Table 2, the first column denotes the sub-dataset, the sec-
4 https://cs.nyu.edu/~roweis/data.html. ond to last column is the performance of each compared baseline.
7
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
Table 2
The performance of different competitive algorithms in different setting with r = 0.01. The best results are highlighted in bold.
Dataset GLLC PUFC TrAdaB LMTTL MADA MDD AdaTLPU
sub-dataset1 0.678 ± 0.045 0.701 ± 0.037 0.762 ± 0.012 0.718 ± 0.014 0.758 ± 0.130 0.768 ± 0.085 0.771 ± 0.018
sub-dataset2 0.663 ± 0.061 0.657 ± 0.043 0.739 ± 0.010 0.722 ± 0.004 0.701 ± 0.065 0.742 ± 0.034 0.755 ± 0.009
sub-dataset3 0.658 ± 0.057 0.661 ± 0.038 0.751 ± 0.015 0.738 ± 0.025 0.743 ± 0.022 0.769 ± 0.018 0.785 ± 0.013
sub-dataset4 0.695 ± 0.040 0.626 ± 0.032 0.714 ± 0.013 0.637 ± 0.005 0.733 ± 0.025 0.752 ± 0.182 0.760 ± 0.002
sub-dataset5 0.713 ± 0.032 0.644 ± 0.012 0.782 ± 0.015 0.718 ± 0.012 0.752 ± 0.031 0.778 ± 0.019 0.793 ± 0.017
sub-dataset6 0.682 ± 0.014 0.663 ± 0.031 0.321 ± 0.014 0.317 ± 0.008 0.706 ± 0.031 0.726 ± 0.028 0.768 ± 0.016
sub-dataset7 0.631 ± 0.012 0.684 ± 0.009 0.669 ± 0.015 0.641 ± 0.015 0.689 ± 0.078 0.701 ± 0.065 0.704 ± 0.014
sub-dataset8 0.511 ± 0.010 0.523 ± 0.013 0.546 ± 0.014 0.521 ± 0.008 0.571 ± 0.120 0.590 ± 0.051 0.608 ± 0.016
sub-dataset9 0.631 ± 0.021 0.684 ± 0.025 0.659 ± 0.035 0.638 ± 0.025 0.669 ± 0.058 0.689 ± 0.095 0.691 ± 0.011
sub-dataset10 0.523 ± 0.014 0.526 ± 0.023 0.560 ± 0.017 0.532 ± 0.018 0.581 ± 0.087 0.601 ± 0.105 0.593 ± 0.005
sub-dataset11 0.631 ± 0.022 0.684 ± 0.019 0.671 ± 0.009 0.648 ± 0.021 0.681 ± 0.121 0.705 ± 0.059 0.710 ± 0.013
sub-dataset12 0.501 ± 0.004 0.513 ± 0.007 0.552 ± 0.013 0.525 ± 0.019 0.563 ± 0.025 0.585 ± 0.081 0.589 ± 0.006
sub-dataset13 0.638 ± 0.011 0.659 ± 0.020 0.718 ± 0.007 0.702 ± 0.032 0.741 ± 0.125 0.759 ± 0.105 0.768 ± 0.015
sub-dataset14 0.513 ± 0.016 0.514 ± 0.024 0.548 ± 0.025 0.536 ± 0.045 0.560 ± 0.131 0.582 ± 0.200 0.595 ± 0.009
sub-dataset15 0.519 ± 0.003 0.525 ± 0.027 0.539 ± 0.014 0.530 ± 0.057 0.545 ± 0.125 0.559 ± 0.098 0.572 ± 0.011
sub-dataset16 0.631 ± 0.009 0.623 ± 0.017 0.658 ± 0.037 0.650 ± 0.046 0.680 ± 0.085 0.713 ± 0.023 0.717 ± 0.018
sub-dataset17 0.742 ± 0.019 0.759 ± 0.022 0.795 ± 0.028 0.781 ± 0.027 0.831 ± 0.103 0.826 ± 0.035 0.832 ± 0.006
sub-dataset18 0.556 ± 0.028 0.579 ± 0.030 0.619 ± 0.026 0.606 ± 0.014 0.639 ± 0.128 0.647 ± 0.058 0.662 ± 0.009
Table 3
The result of Wilcoxon Sign-Rank test with p-value and AdaTLPU differs highly
significant (p < 0.05) to those methods highlighted in bold.
Setting GLLC PUFC TrAdaB LMTTL MADA MDD
p-value 0.0002 0.0002 0.0002 0.0002 0.0002 0.0008
Fig. 2. The performance curves on several sub-datasets for five classifiers AdaTLPU, MDD, MADA, LMTTL, TrAdaB, PUFC, GLLC.
The authors declare that they have no known competing finan- Ss+ Ss−
∑ ∑
cial interests or personal relationships that could have appeared + Ctl mt (xk )ξkl + Ctl
+
t (xh )ξhl
m−
to influence the work reported in this paper. St + St −
9
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
∑
αils+ (ω0l + vsl )T · φ (xi ) + b1l + ξil − 1 0 ≤ αkl
t+
t (xk ), 0 ≤ αhl ≤ Ctl mt (xh ).
≤ Ctl m+ t− −
[ ]
−
Ss+
∑ Through the above work, the dual form is proved. □
αjls− − (ω0l + vsl )T · φ (xj )
[
−
Ss− References
− b1l + ξjl − 1
]
∑ [1] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Hongjun Lu, Philip S. Yu, Text
αklt + (ω0l + vtl )T · φ (xk ) + b2l + ξkl − 1
[ ]
− classification without negative examples revisit, IEEE Trans. Knowl. Data
Eng. 18 (1) (2006) 6–20.
St +
∑ [2] Bing Liu, Wee Sun Lee, Philip S. Yu, Xiaoli Li, Partially supervised classifi-
αhlt − − (ω0l + vtl )T · φ (xh ) cation of text documents, in: Claude Sammut, Achim G. Hoffmann (Eds.),
[
−
Machine Learning, Proceedings of the Nineteenth International Conference
St − (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12,
2002, Morgan Kaufmann, 2002, pp. 387–394.
−b2l + ξhl − 1
]
∑ ∑ [3] Xiaoli Li, Bing Liu, Learning to classify texts using positive and unla-
− βils+ ξil − βjls− ξjl beled data, in: Georg Gottlob, Toby Walsh (Eds.), IJCAI-03, Proceedings
of the Eighteenth International Joint Conference on Artificial Intelli-
Ss+ Ss− gence, Acapulco, Mexico, August 9-15, 2003, Morgan Kaufmann, 2003, pp.
∑ ∑ 587–594.
− β ξ −
t+
kl kl βhlt − ξhl . (A.1) [4] Xiaoli Li, Philip S. Yu, Bing Liu, See-Kiong Ng, Positive unlabeled learning
St + St − for data stream classification, in: Proceedings of the SIAM International
Conference on Data Mining, SDM 2009, April 30 - May 2, 2009, Sparks,
Differentiate the Lagrangian function (A.1) with ω0l , vsl , vtl , b1l , b2l , Nevada, USA, SIAM, 2009, pp. 259–270.
ξil , ξjl , ξkl and ξhl . The following equations are obtained: [5] Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, Philip S. Yu, Building text
∑ ∑ classifiers using positive and unlabeled examples, in: Proceedings of the
∇ω0l L = ω0l − αils+ φ (xi ) + αjls− φ (xj ) 3rd IEEE International Conference on Data Mining (ICDM 2003), 19-22
December 2003, Melbourne, Florida, USA, IEEE Computer Society, 2003,
Ss+ Ss−
pp. 179–188.
∑ ∑
− α φ (xk ) +
t+
kl αhlt − φ (xh ) = 0, (A.2) [6] Hwanjo Yu, Jiawei Han, Kevin Chen-Chuan Chang, PEBL: Web page clas-
sification without negative examples, IEEE Trans. Knowl. Data Eng. 16 (1)
St + St − (2004) 70–81.
∑ ∑ [7] Hong Shi, Shaojun Pan, Jian Yang, Chen Gong, Positive and unlabeled
∇vsl L = λs vsl − αils+ φ (xi ) + αjls− φ (xj ) = 0, (A.3) learning via loss decomposition and centroid estimation, in: Jérôme Lang
Ss+ Ss− (Ed.), Proceedings of the Twenty-Seventh International Joint Conference
∑ ∑ on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden,
∇vtl L = λt vtl − αklt + φ (xk ) + αhlt − φ (xh ) = 0, (A.4) ijcai.org, 2018, pp. 2689–2695.
St + St − [8] Chuang Zhang, Dexin Ren, Tongliang Liu, Jian Yang, Chen Gong, Positive
∑ ∑ and unlabeled learning with label disambiguation, in: Sarit Kraus (Ed.),
∇b1l L = − αils+ + αjls− = 0, (A.5) Proceedings of the Twenty-Eighth International Joint Conference on Arti-
ficial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org,
Ss+ Ss−
2019, pp. 4250–4256.
∑ ∑
[9] Charles Elkan, Keith Noto, Learning classifiers from only positive and
∇b2l L = − αklt + + αhlt − = 0, (A.6)
unlabeled data, in: Ying Li, Bing Liu, Sunita Sarawagi (Eds.), Proceedings of
St + St − the 14th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, ACM, 2008,
s (xi ) − (αil + µi ) = 0,
s+ s+
∇ξil L = Cs m+ (A.7) pp. 213–220.
s (xj ) − (αjl + µj ) = 0,
s− s− [10] Marthinus Christoffel du Plessis, Gang Niu, Masashi Sugiyama, Analysis of
∇ξjl L = Cs m− (A.8)
learning from positive and unlabeled data, in: Zoubin Ghahramani, Max
∇ξkl L = Ct mt (xk ) − (α
+ t+
kl +µ t+
k ) = 0, (A.9) Welling, Corinna Cortes, Neil D. Lawrence, Kilian Q. Weinberger (Eds.),
Advances in Neural Information Processing Systems 27: Annual Conference
∇ξhl L = Ct mt (xh ) − (α
− t−
hl +µ t−
h ) = 0. (A.10) on Neural Information Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada, 2014, pp. 703–711.
Eqs. (A.11)–(A.13) can be obtained by (A.2)–(A.4) [11] Noah Youngs, Dennis E. Shasha, Richard Bonneau, Positive-unlabeled learn-
ing in the face of labeling bias, in: IEEE International Conference on Data
1 ∑ s+ ∑ Mining Workshop, ICDMW 2015, Atlantic City, NJ, USA, November 14-17,
vsl = ( αil φ (xi ) + αjls− φ (xj )), (A.11)
λs 2015, IEEE Computer Society, 2015, pp. 639–645.
Ss+ Ss− [12] Shuangxun Ma, Ruisheng Zhang, PU-LP: A novel approach for positive
1 ∑ t+ ∑ and unlabeled learning by label propagation, in: 2017 IEEE International
vtl = ( αkl φ (xk ) + αhlt − φ (xh )), (A.12) Conference on Multimedia & Expo Workshops, ICME Workshops, Hong
λt Kong, China, July 10-14, 2017, IEEE Computer Society, 2017, pp. 537–542.
St + St −
[13] Jing Gao, Wei Fan, Jing Jiang, Jiawei Han, Knowledge transfer via multiple
ω0l = λs vsl + λt vtl . (A.13) model local structure mapping, in: Ying Li, Bing Liu, Sunita Sarawagi
(Eds.), Proceedings of the 14th ACM SIGKDD International Conference on
Substituting (A.5)–(A.10) into the Lagrangian function (A.1) can Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August
24-27, 2008, ACM, 2008, pp. 283–291.
obtain (A.14)
[14] Hiba Chougrad, Hamid Zouaki, Omar Alheyane, Multi-label transfer learn-
1 λs λt ing for the early diagnosis of breast cancer, Neurocomputing 392 (2020)
L(α, µ) = ∥ω0l ∥2 + ∥νs ∥2 +
∥νt ∥2 168–180.
2∑ 2 ∑ 2 [15] Xin Zheng, Luyue Lin, Bo Liu, Yanshan Xiao, Xiaoming Xiong, A multi-task
− αils+ (ω0l + vsl )T φ (xi ) + αjls− (ω0l + vsl )T φ (xj ) transfer learning method with dictionary learning, Knowl. Based Syst. 191
(2020) 105233.
Ss+ Ss−
∑ ∑ [16] Sotiris B. Kotsiantis, Bagging and boosting variants for handling
− α t+
kl (ω0l + vtl ) φ (xk ) +T
αhlt − (ω0l + vtl )T φ (xh ) classifications problems: a survey, Knowl. Eng. Rev. 29 (1) (2014) 78–100.
[17] Maxime Latulippe, Alexandre Drouin, Philippe Giguère, François Lavio-
St + St − lette, Accelerated robust point cloud registration in natural environments
through positive and unlabeled learning, in: Francesca Rossi (Ed.), IJCAI
∑ ∑ ∑ ∑
+ αs+
il + α s−
jl + α t+
kl + αhlt − . (A.14) 2013, Proceedings of the 23rd International Joint Conference on Artifi-
Ss+ Ss− St + St − cial Intelligence, Beijing, China, August 3-9, 2013, IJCAI/AAAI, 2013, pp.
2480–2487.
The range of αils+ , αjls− , αkl
t+
and αhl
t−
be derived by (A.7)–(A.10) [18] Slim Kanoun, Adel M. Alimi, Yves Lecourtier, Natural language morphology
integration in off-line arabic optical text recognition, IEEE Trans. Syst. Man
0 ≤ αils+ ≤ Csl m+
s (xi ), 0 ≤ αjl ≤ Csl ms (xj ),
s− −
Cybern. B 41 (2) (2011) 579–590.
10
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
[19] Wee Sun Lee, Bing Liu, Learning with positive and unlabeled examples [39] Brian Quanz, Jun Huan, Large margin transductive transfer learning, in:
using weighted logistic regression, in: Tom Fawcett, Nina Mishra (Eds.), David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, Jimmy J.
Machine Learning, Proceedings of the Twentieth International Conference Lin (Eds.), Proceedings of the 18th ACM Conference on Information and
(ICML 2003), August 21-24, 2003, Washington, DC, USA, AAAI Press, 2003, Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6,
pp. 448–455. 2009, ACM, 2009, pp. 1327–1336.
[20] Ting Ke, Ling Jing, Hui Lv, Lidong Zhang, Yaping Hu, Global and local [40] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, Jianmin Wang, Multi-
learning from positive and unlabeled examples, Appl. Intell. 48 (8) (2018) adversarial domain adaptation, in: Sheila A. McIlraith, Kilian Q. Weinberger
2373–2392. (Eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial
[21] Sundararajan Sellamanickam, Priyanka Garg, Sathiya Keerthi Selvaraj, A Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intel-
pairwise ranking based approach to learning with positive and unlabeled ligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances
examples, in: Craig Macdonald, Iadh Ounis, Ian Ruthven (Eds.), Proceedings in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February
of the 20th ACM Conference on Information and Knowledge Management, 2-7, 2018, AAAI Press, 2018, pp. 3934–3941.
CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, ACM, 2011,
[41] Yuchen Zhang, Tianle Liu, Mingsheng Long, Michael I. Jordan, Bridging the-
pp. 663–672.
[22] Marthinus Christoffel du Plessis, Gang Niu, Masashi Sugiyama, Convex ory and algorithm for domain adaptation, in: Kamalika Chaudhuri, Ruslan
formulation for learning from positive and unlabeled data, in: Francis R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on
Bach, David M. Blei (Eds.), Proceedings of the 32nd International Confer- Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA,
ence on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, in: in: Proceedings of Machine Learning Research, vol. 97, PMLR, 2019, pp.
JMLR Workshop and Conference Proceedings, vol. 37, JMLR.org, 2015, pp. 7404–7413.
1386–1394. [42] Piotr Koniusz, Yusuf Tas, Fatih Porikli, Domain adaptation by mixture
[23] Sinno Jialin Pan, Qiang Yang, A survey on transfer learning, IEEE Trans. of alignments of second-or higher-order scatter tensors, in: 2017 IEEE
Knowl. Data Eng. 22 (10) (2010) 1345–1359. Conference on Computer Vision and Pattern Recognition, CVPR 2017,
[24] Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, Ruixin Zhu, Transfer learning Honolulu, HI, USA, July 21-26, 2017, IEEE Computer Society, 2017, pp.
on heterogenous feature spaces via spectral transformation, in: Geoffrey I. 7139–7148.
Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, Xindong Wu (Eds.), [43] Yabin Zhang, Hui Tang, Kui Jia, Mingkui Tan, Domain-symmetric networks
ICDM 2010, the 10th IEEE International Conference on Data Mining, for adversarial domain adaptation, in: IEEE Conference on Computer Vision
Sydney, Australia, 14-17 December 2010, IEEE Computer Society, 2010, and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20,
pp. 1049–1054. 2019, Computer Vision Foundation / IEEE, 2019, pp. 5031–5040.
[25] Hua-Yan Wang, Qiang Yang, Transfer learning by structural analogy, in: [44] Ben Derrick, Paul White, Comparing two samples from an individual Likert
Wolfram Burgard, Dan Roth (Eds.), Proceedings of the Twenty-Fifth AAAI question, Int. J. Math. Stat. 18 (2017).
Conference on Artificial Intelligence, AAAI 2011, San Francisco, California,
[45] Pratt, W. John, Remarks on zeros and ties in the Wilcoxon signed rank
USA, August 7-11, 2011, AAAI Press, 2011.
[26] Jing Jiang, ChengXiang Zhai, Instance weighting for domain adaptation procedures, Publ. Am. Statal Assoc. 54 (287) (1959) 655–667.
in NLP, in: John A. Carroll, Antal van den Bosch, Annie Zaenen (Eds.),
ACL 2007, Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics, June 23-30, 2007, Prague, Czech Republic, The
Association for Computational Linguistics, 2007. Bo Liu is a professor with the Faculty of Automa-
[27] Xuejun Liao, Ya Xue, Lawrence Carin, Logistic regression with an auxiliary tion, Guangdong University of Technology. His research
data source, in: Luc De Raedt, Stefan Wrobel (Eds.), Machine Learning, interests include support vector machine, feature ex-
Proceedings of the Twenty-Second International Conference (ICML 2005), traction, clustering. He has published papers on IEEE
Bonn, Germany, August 7-11, 2005, in: ACM International Conference Transactions on Neural Networks, IEEE Transactions
Proceeding Series, vol. 119, ACM, 2005, pp. 505–512. on Knowledge and Data Engineering, Knowledge and
[28] Pengcheng Wu, Thomas G. Dietterich, Improving SVM accuracy by training Information Systems, IEEE International Conference on
on auxiliary data sources, in: Carla E. Brodley (Ed.), Machine Learning, Data Mining (ICDM), SIAM International Conference on
Proceedings of the Twenty-First International Conference (ICML 2004),
Data Mining (SDM) and ACM International Conference
Banff, Alberta, Canada, July 4-8, 2004, in: ACM International Conference
on Information and Knowledge Management (CIKM).
Proceeding Series, vol. 69, ACM, 2004.
[29] Tony Jebara, Multi-task feature and kernel selection for SVMs, in: Carla E.
Brodley (Ed.), Machine Learning, Proceedings of the Twenty-First Interna-
tional Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, in: Changdong Liu is obtaining his Master degree at the
ACM International Conference Proceeding Series, vol. 69, ACM, 2004. Department of Automation, Guangdong University of
[30] Su-In Lee, Vassil Chatalbashev, David Vickrey, Daphne Koller, Learning Technology. His research interests include positive and
a meta-level prior for feature relevance from multiple related tasks, in: unlabelled learing and boosting methods.
Zoubin Ghahramani (Ed.), Machine Learning, Proceedings of the Twenty-
Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June
20-24, 2007, in: ACM International Conference Proceeding Series, vol. 227,
ACM, 2007, pp. 489–496.
[31] Theodoros Evgeniou, Massimiliano Pontil, Regularized multi–task learning,
in: Won Kim, Ron Kohavi, Johannes Gehrke, William DuMouchel (Eds.),
Proceedings of the Tenth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, Seattle, Washington, USA, August 22-25,
2004, ACM, 2004, pp. 109–117.
Yanshan Xiao received the Ph.D. degree in computer
[32] Lilyana Mihalkova, Tuyen N. Huynh, Raymond J. Mooney, Mapping and
science from the Faculty of Engineering and In for-
revising Markov logic networks for transfer learning, in: Proceedings of
the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, mation Technology, University of Technology, Sydney,
2007, Vancouver, British Columbia, Canada, AAAI Press, 2007, pp. 608–614. Australia, in 2011. She is with the Faculty of Computer,
[33] Matthew Richardson, Pedro M. Domingos, Markov logic networks, Mach. Guangdong University of Technology. Her research in-
Learn. 62 (1–2) (2006) 107–136. terests include data mining and machine learning. She
[34] Chris Buckley, Gerard Salton, James Allan, The effect of adding relevance has published papers on IEEE PAMI, IEEE Transac-
information in a relevance feedback environment, in: W. Bruce Croft, tions on Neural Networks and Learning Systems, IEEE
C.J. van Rijsbergen (Eds.), Proceedings of the 17th Annual International Transactions on Cybernetics, Knowledge and Informa-
ACM-SIGIR Conference on Research and Development in Information tion Systems, and International Joint Conferences on
Retrieval. Dublin, Ireland, 3-6 July 1994 (Special Issue of the SIGIR Forum), Artificial Intelligence (IJCAI).
ACM/Springer, 1994, pp. 292–300.
[35] Vladimir Vapnik, The Nature of Statistical Learning Theory, Springer, 2013.
[36] Yoav Freund, Robert E. Schapire, A decision-theoretic generalization of on- Laiwang Liu is obtaining his Master degree at the
line learning and an application to boosting, J. Comput. System Sci. 55 (1) Department of Automation, Guangdong University of
(1997) 119–139. Technology. His research interests include data mining
[37] TingTing Li, WeiYa Fan, YunSong Luo, A method on selecting reliable and machine learning.
samples based on fuzziness in positive and unlabeled learning, 2019, CoRR,
abs/1903.11064.
[38] Wenyuan Dai, Qiang Yang, Gui-Rong Xue, Yong Yu, Boosting for transfer
learning, in: Zoubin Ghahramani (Ed.), Machine Learning, Proceedings of
the Twenty-Fourth International Conference (ICML 2007), Corvallis, Ore-
gon, USA, June 20-24, 2007, in: ACM International Conference Proceeding
Series, vol. 227, ACM, 2007, pp. 193–200.
11
B. Liu, C. Liu, Y. Xiao et al. Knowledge-Based Systems 241 (2022) 108162
Weibin Li is obtaining his Master degree at the Xiaodong Chen is obtaining his Master degree at the
Department of Automation, Guangdong University of Department of Automation, Guangdong University of
Technology. His research interests include data mining Technology. His research interests include data mining
and machine learning. and machine learning.
12