Rumus TF IDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Knowledge-Based Systems 207 (2020) 106399

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Several alternative term weighting methods for text representation


and classification

Zhong Tang, Wenqiang Li , Yan Li, Wu Zhao, Song Li
School of Mechanical Engineering, Sichuan University, Chengdu 610065, China
Innovation Method and Creative Design Key Laboratory of Sichuan Province, Chengdu 610065, China

article info a b s t r a c t

Article history: Text representation is one kind of hot topics which support text classification (TC) tasks. It has
Received 29 April 2020 a substantial impact on the performance of TC. Although the most famous TF–IDF is specially
Received in revised form 5 August 2020 designed for information retrieval rather than TC tasks, it is highly useful in the field of TC as a term
Accepted 8 August 2020
weighting method to represent text contents. Inspired by the IDF part of TF–IDF which is defined as
Available online 14 August 2020
the logarithmic transformation, we proposed several alternative methods in this study to generate
Keywords: unsupervised term weighting schemes that can offset the drawback confronting TF–IDF. Moreover,
Unsupervised term weighting owing to TC tasks are different from information retrieval, representing test texts as a vector in an
Supervised term weighting appropriate way is also essential for TC tasks, especially for supervised term weighting approaches
Text representation (e.g., TF–RF), mainly due to these methods need to use category information when weighting the
Text classification terms. But most of current schemes do not clearly explain how to represent test texts with their
Nonlinear transformation schemes. To explore this problem and seek a reasonable solution to these schemes, we analyzed a
classic unsupervised term weighting method and three typical supervised term weighting methods
in depth to illustrate how to represent test texts. To investigate the effectiveness of our work,
three sets of experiments are designed to compare their performance. Comparisons show that our
proposed methods can indeed enhance the performance of TC, and sometimes even outperform
existing supervised term weighting methods.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction dk [14,15], and each row denotes one of the document vectors,
whereas each column corresponds to one of the distinct terms
Automatic text classification (TC) technology can efficiently (i.e., selected features).
organize and categorize text that increasing dramatically [1], thus Term weighting is critical to TC task that has a direct and
it eliminates a large amount of human effort [2] and attracted significant effect on the text classification performance [16]. At
a wide attention in recent years [3,4]. The goal of the TC task present, term weighting approaches are generally grouped into
is to categorize unlabeled texts into a predefined class based on unsupervised and supervised according to whether they em-
their topics [5], and hence based on a set of prelabeled texts, an brace the class information of training texts [17,18]. The unsu-
automatic text classifier can be established in the learning process pervised term weighting (UTW) methods neglects the diversity of
[6,7]. Before applying classifiers, every term in the text first need class information, whereas the supervised term weighting (STW)
to be assigned numerical values (weights) in an appropriate term methods exploit the category information when calculating the
weighting scheme that is called text representation [8,9]. The weight. For UTW schemes, such as term frequency (TF) and TF–
vector space model (VSM) is the most popular way to represent IDF (term frequency–inverse document frequency) are commonly
texts [10,11], it usually treats a text as a set of terms namely used. Among them, TF is one of the simplest weighting methods,
bag of words (BoW) model [12,13]. VSM constructs the text but it is a local weighting approach due to it only considers
collections as a document-term matrix, in which the term weight how many times a term occurs in the text. To conquer this
represents importance of a certain term tj in a certain document drawback, an inverse document frequency (IDF) has been de-
signed to generate the TF–IDF scheme, it is concerned with how
many texts a term has appeared in. Note that, TF–IDF has been
∗ Correspondence to: School of Mechanical Engineering, Sichuan
primarily designed for information retrieval (IR) rather than TC
University, No. 24 South Section 1, Yihuan Road, Chengdu 610065, China.
E-mail addresses: tangzscu@163.com (Z. Tang), liwenqiang@scu.edu.cn
tasks [10,19].
(W. Li), liyan@scu.edu.cn (Y. Li), zhaowu@scu.edu.cn (W. Zhao), Different from IR task, TC task aims to discriminate between
sculisong1992@163.com (S. Li). different classes, not texts [20], and thus it should take category

https://doi.org/10.1016/j.knosys.2020.106399
0950-7051/© 2020 Elsevier B.V. All rights reserved.
2 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

factors into account when computing the weight of terms. For explained how to represent test texts in the weighting process
that reason, it can be said that TC is a supervised learning task through a classic UTW method and three typical STW methods
[9,17,21,22]. Recently, most STW methods are originated from with different characteristics. Thirdly, a nonlinear transformation
feature selection schemes, these methods adopt the category method is designed and used for TF, which performs better than
information in several ways, and can be summarized as fol- the square root function-based TF. Finally, we conduct an ex-
lows. First of all, TF-CHI2, TF-IG and TF-GR have been proposed tensive experimental comparison of our schemes with existing
based on feature selection approaches (i.e., Chi-square statistic UTW and STW methods. Experimental results demonstrate the
(CHI2), information gain (IG) and gain ratio (GR)) [18]. Since then, effectiveness and superiority of our proposed schemes.
various STW methods similar to the above schemes have also The rest of this manuscript is outlined as follows. Section 2
been presented, for example, odds ratio (OR) weighting factor in takes several existing term weighting methods as examples to
TF-OR [14,23,24], mutual information (MI) weighting factor in TF- clearly show how to represent test texts. In Section 3, we pro-
MI [14,23], probability-based (PB) weighting factor in TF-PB [24], pose four UTW methods for TC. The experimental settings are
and correlation coefficient weighting factor in TF-CC [24]. explained in Section 4. In Section 5, our findings are presented
Apart from these schemes, a variety of STW schemes have
and discussed. We conclude our work in Section 6.
been built and proposed, which are derived from TF–IDF. Initially,
inspired by the IDF in TF–IDF scheme, the inverse class frequency
(ICF) has been introduced, it indicates that a key term of the spe- 2. Analysis of current term weighting schemes
cific class usually appears in only a few categories [25]. However,
due to the number of categories is generally quite small, a certain There is no doubt that term weighting is essential for TC
term may occasionally exist in multiple categories or sometimes task [32], which measures the importance of a term (feature)
even in all categories [25,26]. As a result, ICF fails to promote the in representing the content of a text [14,15]. Next, we analyzed
degree of importance of a term under certain circumstances. To the term weighting approaches (unsupervised, supervised) in
enhance a term’s distinguishing power, the ICF has been incor- depth due to they were recently proposed or are very related
porated to generate the TF–IDF–ICF scheme [14,25]. Meanwhile, to ours, namely TF–IDF, TF–RF, TF–IDF–ICSDF, and RTF-IGMimp .
in [14], the authors pointed out that TF–IDF–ICF scheme empha- Importantly, through these methods we clearly explained how to
sizes on rare terms. To alleviate this problem, they redesigned the represent test texts.
ICF and proposed a novel scheme called TF–IDF–ICSDF. Besides
these, Lan et al. [17] proved that the distinguishing abilities of a
term depends only on the relevant texts that contain this term, 2.1. Unsupervised term weighting schemes
and hence they presented a novel STW scheme by replacing IDF
in TF–IDF with the RF (relevance frequency). More importantly, To facilitate the definition of different term weighting meth-
the results in many literatures demonstrated that TF–RF achieves ods, the notations used in this research are first presented, as
better performance than most STW and UTW approaches [9,17, shown in Table 1. With the above notations, the mathemati-
23]. Moreover, in [22], the authors claimed that TF–IDF was not cal expressions of TF–IDF, TF–RF, TF–IDF–ICSDF, and RTF-IGMimp
necessarily suitable for TC task, hence two novel term weighting schemes are defined in Table 2, where the fourth column in
method called TF-IGM and RTF-IGM were presented. Further, the table indicates the corresponding test text representation
to deal with the drawback of TF-IGM, two improved weighting scheme. Furthermore, unless otherwise specified, the base of the
methods based on IGM have also been developed [27]. logarithm is e (natural number) in this study.
We must note, however, that not all STW schemes are always TF–IDF, as a classic unsupervised term weighting method,
superior to UTW schemes [17,20,26,28]. Moreover, as emphasized has become the default choice during calculating the weight of
in [22], most STW schemes show its own superiority in some spe- terms [24]. It should be pointed out that the term weight is
cific TC tasks, but in fact they cannot consistently yield the best mainly consisted of two factors [21,28]. In here, we take the TF–
classification performance except that they require more storage IDF method as an example. The first component is the local factor
space and running time [14]. Because of these, the motivation in a text (i.e., TF), the second component is the global factor in
of this paper is to focus on UTW methods. We have derived the dataset (i.e., IDF). Besides, it may be noted that IDF has one
inspiration from the IDF part of TF–IDF which is defined as the disadvantage, that is, if the term tj appears in all text namely
logarithmic transformation, and many works are seeking to build
df (tj ) = N, then the IDF (tj ) score will be log(N/N) = 0. In fact, to
a theoretical basis for it [29]. Hence, we naturally doubt whether
address this issue, an improved term weighting method namely
there are other nonlinear transformation forms like logarithmic
term frequency–inverse exponential frequency (TF–IEF) has been
transformation that can achieve better classification performance
proposed in our previous study [28] and its mathematical formula
than existing ones? This is the first question we wish to address
is shown in Table 2.
in this research. Besides that, due to the test texts do not have
Fig. 1(a) presents the steps of TF–IDF scheme in text rep-
any prior category label information [30,31], thus how to prop-
erly represent test texts is a difficult but significant task [30], resentation. This process consists of four steps: preprocessing,
especially for the STW schemes due to these methods need to feature selection, generate document-term matrix based on TF,
use category information in the weighting process, these include and calculate the weight of a given term. It is worth noting
TF–RF [17], TF–IDF–ICSDF [14], TF-IGM [22], RTF-IGMimp [27] and that the major difference between training text and test text
so on. Nevertheless, it is not clearly explain how to represent representation is local weighting factor, while they use the same
test texts in most of existing schemes [31] including TF–IDF. global weighting factors. That is, to represent test texts for TC
Naturally, we raise the second question: How to represent test task, test text also uses training set information(in the weighting
process. For example, when the global factor log N /df (tj ) is used
)
texts for TC task? Besides these two questions, the third question
will be raised in Section 3. And the answer will be presented to calculate the weight of test text, then the N represents the total
at the end of this article. Our contributions in this work can be number of texts in the training set, rather than the total number
summarized in the following: firstly, several nonlinear transfor- of texts in the test set (i.e., M). The pseudo code of TF–IDF scheme
mation methods are introduced and compared with the famous is shown in Algorithm 1, which clearly shows how to use UTW
TF–IDF which adopts the logarithmic ratio. Secondly, we clearly method to represent training text and test text.
Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399 3

Table 1
Descriptions of notations used to represent term weighting methods.
Notation Description
Aij Number of documents that contain feature tj in the positive category.
Bij Number of documents that do not contain feature tj in the positive category.
Cij Number of documents that contain feature tj in the negative category.
Dij Number of documents that do not contain feature tj in the negative category.
TP The count of test documents belongs to category ci correctly classified into category ci .
FP The count of test documents do not belong to category ci incorrectly classified into category ci .
FN The count of test documents that belong to category ci but are not classified into category ci .
TN The count of test documents do not belong to class ci and meanwhile are not classified into class ci .
N Total number of documents in the training set, N = Aij + Bij + Cij + Dij .
M Total number of documents in the test set.
tftrain (tj , dk ) The raw term frequency of feature tj in training document dk (k = 1, 2, . . . , N).
tftest (tj , dl ) The raw term frequency of feature tj in test document dl (l = 1, 2, . . . , M).
df (tj ) Number of documents containing tj in the training set, and df (tj ) = Aij + Cij .
m Total number of categories.
n The number of chosen features.
λ λ is an adjustable coefficient and its default value is set to 7.0, and λ∈[5.0, 9.0].
fji Frequencies of tj ’s occurring in different categories of the training set, which are sorted in descending order with i
(i = 1, 2, . . . , m) being the rank.
nci (tj ) Number of documents containing the term tj in a certain category ci of the training set.
Nci Total number of documents in a certain category ci of the training set.
Dtotal (tj_ max ) Total number of documents in a certain category ci of the training set in which tj appears most.
rtf , ripf rtf and ripf represent the distortion parameter of TF and IPF respectively, their default values are set to 0.5, and rtf ,
ripf (0, 1.0].

Table 2
Term weighting schemes to be compared.
Scheme Name Training document representation Test document representation
IDF (tj )
 (  ) ( )
N N
Unsupervised term TF–IDF tftrain (tj , dk ) × log tftest (tj , dl ) × log
weighting df (tj ) df (tj )
df (tj ) df (tj )
TF–IEF tftrain (tj , dk ) × e − N tftest (tj , dl ) × e − N

RF (tj ,ci )
 (  ) ( )
Aij N
Supervised term TF–RF tftrain (tj , dk ) × log2 2+ tftest (tj , dl ) × log
weighting max(1, Cij ) df (tj )
⎛ ⎞ ICSDF (tj )
  
( ( )) ⎜ ( )⎟ ( ( ))
N m N
⎜ ⎟
tftrain (tj , dk ) × 1 + log tftest (tj , dl ) ×
( )
TF–IDF–ICSDF × ⎜1 + log ∑m ( 1 + log × 1 + ICSDF (tj )
⎜ ⎟
nci (tj )/Nci
) ⎟
df (tj ) ⎜
⎝ i= 1

⎠ ( )df (tj )
tftest tj , dl × log N
( )
( )
df tj

⎛ IGMimp (tj )

  
fj1
⎜ ⎟
tftrain (tj , dk ) × ⎜1 + λ × ∑m tftest (tj , dl ) × 1 + λ × IGMimp (tj )
√ √ ( )
RTF-IGMimp
⎜ ⎟
i=1 fji × i + log10 Dtotal (tj_ max )/fj1
( )⎟
⎝ ⎠ ( )
tftest tj , dl × log N
( )
( )
df tj

N /df (tj ) N /df (tj )


Proposed term TF–IHF tftrain (tj , dk ) × tftest (tj , dl ) ×
weighting 1 + N /df (tj ) 1 + N /df (tj )

1 1
TF–ISF tftrain (tj , dk ) × tftest (tj , dl ) ×
1 + e−N /df (tj ) 1 + e−N /df (tj )
( ) ( )
2 2
TF–ITF tftrain (tj , dk ) × −2N /df (tj )
−1 tftest (tj , dl ) × −2N /df (tj )
−1
1+e 1+e
( )ripf ( )ripf
N N
DTF–IPF (tftrain (tj , dk ))rtf × (tftest (tj , dl ))rtf ×
df (tj ) df (tj )

2.2. Supervised term weighting schemes consideration in the weighting process [17]. As represented in [17],
the authors proved that the weight of a certain term rely only on
The following related works are all recently proposed STW the frequency of relevant texts, that is, Aij /max(1, Cij ). In TF–RF
methods with different characteristics. Moreover, these schemes scheme, the base of the logarithm is 2. Furthermore, we must
are selected in this study since their reported superior classifica-
emphasize that the main influence on RF (tj , ci ) may be constant
tion performances.
value 2, rather than Aij /max(1, Cij ) if its value is less than 2.
2.2.1. TF-RF scheme For instance, statistics show that Aij and Cij of the term ‘costs’
Contrary to the TF–IDF scheme, TF–RF (term frequency- are 150 and 255 respectively in category ‘earn’ of the Reuters-
relevance frequency) scheme takes category information into 21578 dataset [31], thus Aij /max(1, Cij ) = 0.5882. Moreover, as
4 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

emphasized in [29], the base of the logarithm is not important.


Hence, based on the fact, it can be said that the larger the base
of the logarithm is, the larger the constant value is.

2.2.2. TF–IDF–ICSDF scheme


In TF–IDF–ICF scheme, due to the ICF only concerns whether
a certain term belongs to a class or not, and hence it is not a
good discriminator if a term appears in multiple classes [14].
Besides that, from the principle of TF–RF scheme we can see To deal with this problem, a revised version of TF–IDF–ICF is
that, it divides text into positive and negative categories when presented, namely TF–IDF–ICSDF scheme, and its mathematical
weighting a term. More precisely, the current class (target class) expression is shown in Table 2. In TF–IDF–ICSDF scheme, an
is tagged as the positive class, and meanwhile all other classes are unreasonable characteristic can be observed. That is, the ICSDF
defined as the negative class. However, for TC task, the test texts factor will degenerate into IDF factor, this case happens when
do not have any prior category label information [30,31]. Hence, the numbers of texts in individual class are same (i.e., balanced
how to better represent test texts in an appropriate way is a basic datasets). Now, let us discuss two simple examples.
and significant task [30]. Fig. 1(b) presents the steps of TF–RF Fig. 2(a) shows the balanced datasets, the term t1 with its
scheme in text representation. Similar to TF–IDF method, TF–RF document frequencies in two categories being {4, 2}, and each
method is also composed of four steps. For simplicity, Fig. 1(b) categories has 6 texts. Then, the ICSDF (t1 ) = IDF (t1 ) = 0.6931.
shows only the last two steps. It is worth to mention that the Fig. 2(b) shows the unbalanced datasets, the term t2 with its
major difference between TF–IDF and TF–RF schemes is global document frequencies in two categories being {4, 2}, whereas
weighting factors. And moreover, the TF–RF is obtained based on category c1 and c2 contain 6 and 4 texts respectively. Then, the
positive and negative classes, and hence the test text can only be ICSDF (t2 ) = 0.5390, IDF (t2 ) = 0.5108, and thus ICSDF (t2 ) ̸ =
computed based on the IDF of the training set. That is, using TF– IDF (t2 ) in the unbalanced datasets. Therefore, it can be concluded
RF for the training text and TF–IDF for the test set. The pseudo that there is no difference between ICSDF and IDF in the balanced
code of TF–RF scheme is shown in Algorithm 2, which clearly datasets.
presents how to use STW method to represent training text and Fig. 1(c) presents the steps of TF–IDF–ICSDF scheme in text
test text. Moreover, it is important to note that it consumes more representation. Similar to TF–RF method, Fig. 1(c) also shows
computation time than UTW methods, such as TF–IDF. only the last two steps. It is worth noting that although class
Besides TF–RF scheme, there are many similar STW methods information is considered in TF–IDF–ICSDF method, the weight of
are also according to the positive and negative classes [17,26], terms is not calculated according to positive and negative classes,
such as TF-CHI2, TF-IG, and TF-OR. Hence, their text representa- by contrast, the ICSDF is calculated globally over each category
tion is similar to TF–RF. without class discrimination (as shown in Fig. 2). Hence, there
Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399 5

Fig. 1. Term weighting methods. (a) TF–IDF scheme. (b) TF–RF scheme. (c) TF–IDF–ICSDF scheme.

inverse gravity moment (IGM). As mentioned in the source liter-


ature [22], TF-IGM suffers from a major drawback. That is, terms
may achieve the same weights in some special cases. To resolve
this problem, in [27], the authors added a ratio to standard
TF-IGM formula to generate two improved methods, that is, TF-
IGMimp and RTF-IGMimp . The experimental results showed that
RTF-IGMimp was more successful than TF-IGMimp . The reason may
be due to the fact that the square root of TF (named RTF) was
adopted in RTF-IGMimp scheme, and its mathematical formula is
Fig. 2. Schematic diagram of two simple examples.
shown in Table 2.
It is worth to mention that the detailed steps of RTF-IGMimp
scheme in text representation and its pseudo code are similar to
are two ways to compute the weight of test text, namely TF–IDF–
TF–IDF–ICSDF method. In addition to that, due to the IGMimp is
ICSDF and TF–IDF, this is the biggest difference from the TF–RF
method. One of the purposes of our study is to explore how to computed globally over all classes, it is just like ICSDF. Therefore,
represent the test text in an appropriate way, so we will compare there are also two ways to calculate the weight of the test text,
the two approaches in our experiments. Certainly, just like IDF namely RTF-IGMimp and TF–IDF. We will also compare the two
in TF–IDF scheme, test text also uses training set information approaches in our experiments.
when the weight is calculated using TF–IDF–ICSDF. As a typical Based on the above analysis of UTW and STW methods, we
STW method, algorithm 3 clearly shows two ways that the TF– can conclude that no matter which scheme is used to represent
IDF–ICSDF scheme represents test text. There is no doubt that the test texts, its global weighting factor (e.g., IDF, ICSDF, and
the training text is represented by TF–IDF–ICSDF. Furthermore,
IGMimp ) is the same as the training text. Hence, the local weight-
it is obvious that the pseudo code of TF–IDF–ICF method is very
ing factor (i.e., TF) becomes the only difference between training
similar to TF–IDF–ICSDF method.
text and test text representation. Moreover, in terms of the two
STW methods, i.e., TF–IDF–ICSDF and RTF-IGMimp , the test text
2.2.3. RTF-IGM imp scheme
As described in [22], the authors claimed that the inter-class can be represented by themselves (shown in Table 2). Besides
distribution of a certain term should not be neglected. For this themselves, TF–IDF method can also be used to represent test
purpose, the TF-IGM scheme was proposed to capture this ba- text, which is similar to the test text representation in the TF–RF
sic idea, which was based on a new statistical measure named method (shown in Table 2).
6 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

first order derivative fl′ (x) > 0, (iii) the second order derivative
fl′′ (x) < 0. Thus, the alternative schemes we proposed can also
be unbounded. But they should at least incorporate the other
two conditions. To the best of our knowledge, several nonlinear
transformation methods that satisfy the above properties can
be constructed, they are already summarized in Table 3. Note
that, these schemes are all obtained by linear function, i.e., Y =
A+BX. So far, the mathematical forms of these schemes have been
determined, but the other point is to estimate the parameters a,
b, and c in these functions. For simplicity, the values of a, b, and
c are set as shown in Table 4.
Among these functions, previous studies have shown that
the inverse exponential function (denoted as fe (x)) can achieve
satisfactory performance compared with logarithmic function
(i.e., fl (x)) [28]. In this work, except for these two nonlinear
transformation functions, we proposed four nonlinear transfor-
mation functions: hyperbolic function, sigmoid function, tanh
function and proportional distortion function, and they are de-
noted as fh (x), fs (x), ft (x) and fp (x), respectively. Fig. 3(a) shows
the nonlinear mapping of these functions in the interval [0, 6]. For
comparison, the fl (x) and fe (x) are also plotted in this study. And
moreover, the parameter r in the proportional distortion function
is set to 0.3. Similar to the logarithmic function, these functions
are slowly increasing. It is also necessary to note that the sigmoid
function is obtained when a = b = c = 1 in S-shape function, and
thus sigmoid function is a special case of the S-shape function.
Particularly, the tanh function can be regarded as a combination
of two sigmoid functions. As a matter of fact, sigmoid function
and tanh function are often selected as activation functions in
neural networks. Note that, when r ∈ (0,1], the power function
becomes a proportional distortion function [33].
For TC task, these functions should be used as global weighting
factor to fulfill our goal, hence we treat x = N/df (tj ) as an
independent variable to generate the alternative schemes, as
shown in Table 2. Similarly to TF–IEF, these alternative schemes
are named IHF (inverse hyperbolic frequency), ISF (inverse sig-
3. Proposed unsupervised term weighting schemes moid frequency), ITF (inverse tanh frequency), and IPF (inverse
proportional frequency) in this work. Then, we first propose three
From the above arguments, it should be claimed that choose unsupervised term weighting methods are, respectively, denoted
an appropriate metric function used for weighting terms is the as TF–IHF, TF–ISF and TF–ITF, by combining with TF. The weight
key to obtain high-quality performance of TC [7]. Although the enhancing effect of these schemes are plotted in Fig. 3(b). As
TF–IDF weighting scheme borrows from IR field, and it ignores an illustration, we assume that the total number of texts in the
the available category information of training texts [22]. We are training set is 15 (i.e., N = 15) in Fig. 3(b), and hence the
convinced that the TF–IDF scheme is reasonable in TC tasks, document frequency df (tj ) falls into the range [1,15].
the rationale can be interpreted by three fundamental assump- Furthermore, since the df (tj )≤N for a given term tj , then
tions [18], i.e., TF assumption, IDF assumption, and normalization x = N/df (tj ) is greater than or equal to 1. Therefore, under this
assumption. Hence, based on these hypotheses, the UTW schemes condition, the first order derivative of these alternative schemes
we propose should also integrate them. Inspired by the IDF part that we proposed is greater than 0, as illustrated in Fig. 3(c).
of TF–IDF which is defined as the logarithmic transformation, Meanwhile, the corresponding second order derivative is less
several alternative UTW methods are proposed in this section, than 0, as shown in Fig. 3(d). Hence, these alternative methods
and these methods can overcome the drawback confronting TF– satisfy the last two properties mentioned above, which are similar
IDF. Furthermore, we adopt the most classic TF–IDF scheme as a to logarithmic ratio.
baseline for comparison in this research. Although the TF is a major component of weighting meth-
Based on the previous descriptions, roughly the term weight- ods [16], raw TF does not work well in practice since the im-
ing scheme (unsupervised, supervised) can be regarded as the portance of a term cannot increase linearly as its TF increases
product of TF and a certain global function (e.g., IDF, RF), and [34]. To surmount this problem, various nonlinear transformation
thus the TF seems to fit naturally in the weighting process. methods are also used for TF. For instance, the logarithm of TF
Moreover, owing to the weighting scheme is composed of two (e.g., log(tftrain (tj ,√
dk ))) in training document dk [35] and the square
components (local factors and global factors). It is now apparent root of TF (i.e., √ tftrain (tj , dk )). Moreover, some research results
that the weighting method can be improved by changing one show that the tftrain (tj , dk ) is better than the log(tftrain (tj , dk ))
of the components or both. So, first of all, a natural way is to [22,27,36].
multiply TF by a global function like IDF. It must be noted, however, the square root is a special case
Just like the IDF part of TF–IDF, it uses a logarithmic function of proportional distortion function, this case happens when the
(denoted as fl (x)) to transform the document frequency. Further distortion parameter r = 0.5. Consequently, we doubt whether
analysis indicates that this function has three main properties: the use of the proportional distortion function to TF (i.e., r ∈ (0,1])
(i) the logarithmic function is an unbounded function, (ii) the will lead to better performance than the square root (i.e., r =
Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399 7

Table 3
Summary of the nonlinear transformation.
Name Mathematical form Nonlinear transformation by Y = A+BX
Y X A B
Logarithmic function y = a + b log(x) y log(x) a b
b
1
Inverse exponential function y = ae x log(y) x
log(a) b
x 1 1
Hyperbolic function y= a b
a + bx y x
c 1
S-shape function y= e−x a/c b/c
a + be−x y
r r
Power function y = a + bx y x a b

Table 4
The main characteristic of these alternative methods.
Name Mathematical form First order derivative Second order derivative Parameter
a b c
′ 1 ′′ 1
Logarithmic function fl (x) = log(x) fl (x) = fl (x) = − 0 1 –
x x2
1 ′ 1 1 ′′ 1 1 1
Inverse exponential function fe (x) = e− x fe (x) = e− x fe (x) = e− ( x − 2) 1 −1 –
x2 x3 x
x ′ 1 ′′ 2
Hyperbolic function fh (x) = fh (x) = fh (x) = − 1 1 –
1+x (1 + x)2 (1 + x)3
−x −x −x
1 ′ e ′′ e (e − 1)
Sigmoid function fs (x) = fs (x) = fs (x) = 1 1 1
1 + e− x (1 + e−x )2 (1 + e−x )3
2 ′ 4e−2x ′′ 8e−2x (e−2x − 1)
Tanh function ft (x) = −1 ft (x) = ft (x) = – — –
1 + e−2x (1 + e−2x )2 (1 + e−2x )3
′ ′′
Proportional distortion function fp (x) = x , r ∈ (0, 1.0]
r
fp (x) = rx r −1
fp (x) = r(r − 1)x r −2
0 1 –

0.5)? This is the third question we asked in this paper. To confirm term relies solely on TF, which is also consistent with the TF–
this, the TF and IPF are transformed by the proportional distortion RF in [17]. To sum up, our schemes belong to the UTW method.
measure simultaneously when weighting a term, then the fourth Just like TF–IDF, these approaches do not require any class in-
unsupervised term weighting scheme is proposed and name formation during weighting a term [31]. Hence, they are easier
it DTF–IPF, in which DTF stands for distorted term frequency. to understand and implement than other STW schemes, such as
Fig. 4(a) presents the properties of proportional distortion mea- TF–RF, TF–IDF–ICSDF, and so on.
sure for different r. Obviously, on both sides of the critical value
x = 1, distortion value have two reverse effects. The first one, the 4. Experimental settings
distortion value satisfies the mapping: [0,1]→[0,1], which means
that if a small value assigns to the TF or N/df (tj ), their distortion In this study, we use two public text classification datasets
value increases as value of r decreases. The second one, the to validate the performance of our schemes, namely Reuters-
distortion value satisfies the mapping: [1,+∞]→[1,+∞], which 21578 and 20 Newsgroups datasets [37]. Reuters-21578 dataset
means that if a large value assigns to the TF or N/df (tj ), their has 8 different categories including 5485 training texts and 2189
distortion value decreases as value of r decreases. More precisely, test texts. There are 20 categories in 20 Newsgroups corpus,
for a given term tj , the TF and N/df (tj ) are always greater than including 11,293 training texts and 7528 test texts. Moreover,
or equal to 1. Therefore, they are essentially suitable for the we omit those terms that length less than two characters and
second scenario above, which suggests that the term tj will obtain occurrence less than two times. In addition to that, we use porter
smaller TF and N/df (tj ) values as r decreases. Fig. 4(b) presents stemmer for stemming purpose [38] and all terms are converted
the contrast of enhancing effect between IDF and IPF, it is clearly to lowercase letters. And then, punctuation, stop words, numbers
seen from Fig. 4(b) that their trends are basically similar to each and other symbols are deleted. Finally, the Reuters-21578 and
other. Especially, when r = 0.3 or 0.5, IDF and IPF are very close 20 Newsgroups datasets have 8541 and 33,414 different features
to each other. respectively that can be used to train the classifier.
Fig. 5 shows the proposed term weighting methods and their Due to most of the features are redundant or irrelevant that
text classification flow. Specifically, TF–IEF is also plotted in this may harm classification performance [39,40]. Hence, for all
figure to illustrate how to represent the training text and test datasets, the CHI2 is used to select a subset of features that is
text, and the TF–IEF, TF–IHF, TF–ISF and TF–ITF methods can considered as important [41]. Furthermore, two popular machine
be obtained when the distortion parameter of TF rtf = 1. It is learning approaches are used for TC tasks, that is, Multinomial
obvious from Fig. 5 that TF–IEF method and alternative methods Naïve Bayes (MNB) [42–44] and linear support vector machine
we proposed in this study are very similar to the classical TF– (SVM) [7,17,22], and uses the default parameter. For effectively
IDF method. Therefore, the pseudo code of these methods is also evaluating the proposed various alternative methods, the macro-
similar to TF–IDF (i.e., Algorithm 1). For the sake of simplicity, we F1 and micro-F1 are employed [39,45]. Before defining these
will not give pseudo code for these methods. two measures, the F1 measure (denoted as F1 (ci )) should be
Now, let us recall the shortcoming of IDF (tj ) illustrated in Sec- first defined, which is computed by Eq. (1). Then, the macro-
tion 2.1. When df (tj ) = N, the comparison results of six nonlinear F1 and micro-F1 [46] can be calculated using Eqs. (2) and (3),
transformation schemes are presented in Table 5. Obviously, the respectively.
four alternative schemes we proposed are all non-zero constant, 2 · TP(ci )
this property is similar to IEF [28]. At this time, the weight of a F1 (ci ) = (1)
2 · TP(ci ) + FP(ci ) + FN(ci )
8 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

Fig. 3. Illustration of the main characteristics of several alternative methods.

Fig. 4. Proportional distortion measure and its enhancing effect.

Table 5
Comparison of the six nonlinear transformation schemes.
Scheme IDF IEF IHF ISF ITF IPF
df (tj ) = N =0 =1/e =0.5 =1/(1+e−1 ) =2/(1+e−2 )−1 =1
Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399 9

Fig. 5. Text representation methods and their text classification flow.

m
1 ∑
macro − F 1 = F1 (ci ) (2)
m
i=1

∑m
2· TP(ci )
micro − F 1 = ∑m ∑mi=1 ∑m (3)
2· i=1 TP(ci ) + i=1 FP(ci ) + i=1 FN(ci )

5. Experimental results and analysis

In this part, an extensive experimental comparison of our


schemes with existing UTW and STW schemes are performed.
Moreover, all the schemes (summarized in Table 2) are normal-
ized except for RTF-IGMimp , the reasons can be seen in [22], which
will not be re-explained here.

5.1. Evaluating the distortion parameters of DTF–IPF

The first set of experiments is designed to determine the


optimal rtf and ripf and to address the third question. For this
purpose, the orthogonal experimental design is utilized to ar-
range the different values of the parameters scientifically. In the
experiments, the orthogonal table L25 (56 ) is selected to arrange Fig. 6. Classification performances according to distortion parameters rtf and
the combinations of parameters at the different level. The macro- ripf in the Reuters-21578 dataset.
F1 and micro-F1 are used as response index, and the five levels
are set to 0.01, 0.3, 0.5, 0.8, and 1.0, respectively. The results in
different values of distortion parameters are plotted in Figs. 6
and 7. In both datasets, the trends of macro-F1 and micro-F1 of
DTF–IPF method are quite similar.
In detail, Fig. 6 presents the performances obtained on Reuters-
21578 dataset using MNB and SVM classifiers. It has to be noticed
at first that the performances decrease dramatically when the
parameter ripf grows, regardless of the classifier used. In con-
trast, the performances of DTF–IPF not depend on the parameter
rtf very much. Fig. 7 depicts the performances achieved on 20
Newsgroups dataset using MNB and SVM classifiers. Four graphs
in show the performances increase gradually as the parameter
ripf grows, and then the DTF–IPF scheme reaches its peak values
at a small parameter ripf around 0.3. Later, increasing the value
of ripf harms its performance, regardless of the classifier used.
Moreover, it should be noted that the performances of DTF–IPF
is almost fixed for different distortion parameters rtf .
In general, the best classification performances are observed
under rtf = 0.8, and ripf = 0.01 for the Reuters-21578 dataset,
and under rtf = 0.5, and ripf = 0.3 for the 20 Newsgroups
dataset. Hence, the DTF–IPF method uses these parameters in the
next experiments. Importantly, from these figures we confirm the
fact that weighting methods using square root function-based TF
can, but not always, enhance the performance of TC. Moreover, Fig. 7. Classification performances according to distortion parameters rtf and
compared with rtf , ripf has a greater impact on text classification ripf in the 20 Newsgroups dataset.
performance.
10 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

Fig. 8. Boxplots showing the performance of UTW schemes on Reuters-21578 dataset.

5.2. Performance comparisons of several alternative methods that, there are no abnormal points in all data. Moreover, DTF–
IPF is superior to TF–IEF and other methods we have proposed,
The second set of experiments is designed primarily to address i.e., TF–IHF, TF–ISF, and TF–ITF.
the first question and to verify whether the performances of the Overall, based on those observations, it can be concluded that
proposed UTW schemes are better than TF–IDF. We present the these alternative methods we proposed are suitable for weighting
results with boxplots, as shown in Figs. 8 and 9. Here, the hor- the terms. More importantly, the classification performance of
UTW methods generated by them can obtain substantial im-
izontal axis means the different UTW schemes, and the vertical
provements than famous TF–IDF.
axis displays values of the performance.
From Fig. 8, although there are some obvious abnormal points,
5.3. Performance results of existing STW schemes
we can note that in terms of both macro-F1 and micro-F1 are
statistically significant differences. Most importantly, the scores
The third set of experiments is conducted to address the
of the TF–IEF and our proposed approaches outperform TF–IDF
second question which shows how to properly represent test text
substantially and significantly, regardless of the classifier used. In
to enhance the performance of TC. For comparison, we also show
fact, we observe that the median values of the TF–IEF and our
two other STW methods (i.e., TF-IG and TF-OR) in addition to the
proposed methods are far larger the maximum value of TF–IDF.
three STW methods mentioned above (i.e., TF–RF, TF–IDF–ICSDF
It is interesting that the performances gap among the six
and RTF-IGMimp ). The results are already given in Tables 6 and 7.
approaches illustrated in Fig. 8 are more explicit than those Note that, the numbers in parenthesis represent the performance
illustrated in Fig. 9. This may be due to the datasets have dif- of the test text represented by TF–IDF method. For example,
ferent characteristics, e.g., the 20 Newsgroups dataset is a nearly for the TF–IDF–ICSDF method in the fifth column, the numbers
balanced corpus. Because of almost the same reason, for 20 News- in parentheses indicates that the training text is represented
groups dataset, the difference between macro-F1 and micro-F1 using TF–IDF–ICSDF, while the test text is represented by TF–IDF.
is also not notable. Specially, the classification performance of Otherwise, both the training text and test text are represented by
TF–IEF, TF–IHF, TF–ISF, and TF–ITF schemes are very close to TF–IDF–ICSDF method.
each other. However, regardless of the classifier to be used, the For TF–RF scheme, we can easily find that the TF–RF method
maximum or median value of our proposed DTF–IPF method is is obviously superior to TF-IG and TF-OR methods when SVM
always higher than the corresponding results of TF–IDF method classifier is run, regardless of the dataset to be utilized. This is
for both macro-F1 and micro-F1. Comparing TF–IDF and DTF–IPF, consistent with previous observation [17]. For MNB classifier,
it is clear that TF–IDF method achieves the minimum value. Note roughly the performance of TF–RF is higher than the TF-IG and
Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399 11

Fig. 9. Boxplots showing the performance of UTW schemes on 20 Newsgroups dataset.

TF-OR on the Reuters-21578 dataset, while it outperforms TF-IG works [27], namely, the performance of RTF-IGMimp is better than
and TF-OR methods significantly on 20 Newsgroups dataset. This other STW methods.
finding suggests that it is feasible to use TF–IDF to represent the
test text in terms of the STW scheme TF–RF, and the details about 5.4. Overall performances and statistical significance tests
TF–RF scheme are already discussed in Section 2.2.1.
For TF–IDF–ICSDF scheme, we can easily find that the number To identify whether the proposed UTW methods perform bet-
in parentheses is greater than the number out of parentheses (as ter than the existing STW methods, and meanwhile we also want
shown in Table 6), which means that TF–IDF–ICSDF gains the best to know which schemes are better than the existing ones. We
performance with MNB and SVM classifiers on the Reuters-21578 summarize the data based on the above experiments. Figs. 10
dataset, it happens when the training text and test text are rep- and 11 plot the final results of our experiment, where all of the
resented by TF–IDF–ICSDF and TF–IDF methods in the weighting schemes achieve the best classification performance under a cer-
process, respectively. In contrast, for the 20 Newsgroups dataset, tain number of features. More precisely, MNB and SVM classifiers
when the TF–IDF–ICSDF method is used for both training text perform text classification on the Reuters-21578 dataset with 800
and test text representations, the TF–IDF–ICSDF scheme is slightly and 6000 selected features respectively, and on the 20 News-
better than the corresponding other case. This can be explained groups dataset with 16000 selected features for all classifiers.
by the fact that ICSDF and IDF factors are not much different For Reuters-21578 dataset, we can easily find that the TF–IEF
in the uniform class distribution when calculating the weight of and our proposed schemes significantly outperform the classic
terms, and hence this result supports our previous observation in TF–IDF scheme for all classifiers, and surpassed all STW schemes
Section 2.2.2. when SVM classifier is run. For 20 Newsgroups dataset, the DTF–
For RTF-IGMimp scheme, we also observe that RTF-IGMimp can IPF is better than TF–IDF regardless of the classifier used. In
obtain significantly higher classification performance than all addition to that, it significantly outperforms all STW schemes
other STW schemes, regardless of the classifiers or dataset to be including RTF-IGMimp for SVM classifier. And moreover, com-
used. This case happens when the RTF-IGMimp scheme is used paring Tables 6 and 7, it is quite interesting that RTF-IGMimp
to represent training text and test text simultaneously in the shows its own superiority in both datasets when utilizing MNB
weighting process, that is to say, the representation of training as classifier rather than SVM. Most importantly, these results
text and test text is consistent when computing a given term. This once again supports previous findings as described above, that
result proves that our proposed test text representation method is, not all STW schemes are always performs better than UTW
is effective, although it is not clear how to represent the test text schemes [17,20,26,28]. This may be due to the nature of the TC
in [27]. Note that, this result is also in accordance with previous tasks, which suggests that the performance of term weighting
12 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

Table 6
The comparison of performance of STW schemes in the Reuters-21578 dataset using (a) MNB (b) SVM.
Classifier, #Features Macro-F1 Micro-F1
TF-IG TF-OR TF–RF TF–IDF–ICSDF RTF-IGMimp TF-IG TF-OR TF–RF TF–IDF–ICSDF RTF-IGMimp
(a) MNB:
200 0.780 0.829 0.779 0.459 (0.597) 0.888 (0.777) 0.861 0.917 0.893 0.693 (0.821) 0.941 (0.893)
600 0.849 0.835 0.846 0.648 (0.741) 0.902 (0.793) 0.935 0.934 0.946 0.846 (0.909) 0.960 (0.909)
1000 0.846 0.823 0.820 0.695 (0.757) 0.891 (0.786) 0.943 0.933 0.948 0.872 (0.924) 0.959 (0.909)
2000 0.837 0.803 0.813 0.679 (0.732) 0.882 (0.776) 0.942 0.931 0.947 0.885 (0.922) 0.957 (0.906)
3000 0.788 0.799 0.800 0.644 (0.688) 0.880 (0.774) 0.939 0.929 0.946 0.878 (0.910) 0.956 (0.905)
4000 0.756 0.758 0.775 0.637 (0.667) 0.891 (0.780) 0.934 0.931 0.943 0.879 (0.903) 0.959 (0.905)
5000 0.734 0.757 0.767 0.600 (0.663) 0.892 (0.762) 0.926 0.931 0.941 0.872 (0.899) 0.960 (0.899)
6000 0.696 0.758 0.725 0.597 (0.627) 0.894 (0.770) 0.921 0.931 0.934 0.872 (0.889) 0.960 (0.901)
7000 0.672 0.753 0.706 0.590 (0.614) 0.900 (0.763) 0.914 0.931 0.928 0.866 (0.882) 0.959 (0.899)
8000 0.643 0.746 0.688 0.576 (0.602) 0.898 (0.757) 0.905 0.929 0.921 0.858 (0.883) 0.958 (0.897)
(b) SVM:
200 0.479 0.746 0.840 0.833 (0.877) 0.902 (0.822) 0.615 0.769 0.853 0.910 (0.925) 0.942 (0.892)
600 0.638 0.825 0.881 0.881 (0.905) 0.903 (0.835) 0.849 0.890 0.939 0.951 (0.960) 0.957 (0.892)
1000 0.703 0.844 0.891 0.883 (0.903) 0.914 (0.852) 0.865 0.908 0.946 0.953 (0.960) 0.961 (0.893)
2000 0.721 0.847 0.913 0.890 (0.917) 0.907 (0.848) 0.872 0.915 0.953 0.957 (0.962) 0.958 (0.910)
3000 0.722 0.854 0.911 0.890 (0.912) 0.917 (0.852) 0.872 0.919 0.953 0.957 (0.963) 0.966 (0.907)
4000 0.722 0.855 0.924 0.892 (0.922) 0.913 (0.864) 0.873 0.921 0.959 0.958 (0.964) 0.967 (0.924)
5000 0.724 0.855 0.917 0.896 (0.921) 0.925 (0.877) 0.876 0.910 0.952 0.959 (0.965) 0.970 (0.928)
6000 0.724 0.854 0.922 0.897 (0.926) 0.925 (0.874) 0.876 0.909 0.953 0.959 (0.966) 0.971 (0.927)
7000 0.724 0.853 0.922 0.897 (0.927) 0.924 (0.876) 0.876 0.908 0.954 0.958 (0.967) 0.970 (0.928)
8000 0.724 0.853 0.922 0.897 (0.926) 0.926 (0.879) 0.876 0.908 0.953 0.958 (0.965) 0.970 (0.931)

Table 7
The comparison of performance of STW schemes in the 20 Newsgroups dataset using (a) MNB (b) SVM.
Classifier, #Features Macro-F1 Micro-F1
TF-IG TF-OR TF–RF TF–IDF–ICSDF RTF-IGMimp TF-IG TF-OR TF–RF TF–IDF–ICSDF RTF-IGMimp
(a) MNB:
200 0.587 0.594 0.599 0.585 (0.589) 0.636 (0.620) 0.600 0.605 0.615 0.606 (0.608) 0.639 (0.621)
600 0.669 0.674 0.693 0.678 (0.684) 0.722 (0.702) 0.680 0.689 0.710 0.701 (0.705) 0.730 (0.707)
1000 0.701 0.709 0.733 0.723 (0.723) 0.752 (0.735) 0.712 0.725 0.749 0.743 (0.743) 0.760 (0.740)
2000 0.721 0.739 0.763 0.754 (0.753) 0.780 (0.762) 0.731 0.752 0.776 0.771 (0.770) 0.787 (0.768)
4000 0.736 0.760 0.785 0.776 (0.777) 0.808 (0.790) 0.741 0.772 0.797 0.792 (0.793) 0.815 (0.794)
6000 0.740 0.761 0.790 0.787 (0.782) 0.818 (0.799) 0.749 0.774 0.802 0.802 (0.798) 0.825 (0.803)
8000 0.741 0.767 0.798 0.794 (0.790) 0.824 (0.807) 0.750 0.779 0.810 0.809 (0.806) 0.831 (0.812)
10 000 0.742 0.770 0.801 0.798 (0.793) 0.829 (0.810) 0.751 0.783 0.813 0.813 (0.808) 0.835 (0.815)
12 000 0.742 0.773 0.804 0.801 (0.795) 0.830 (0.812) 0.752 0.787 0.817 0.816 (0.811) 0.837 (0.818)
14 000 0.741 0.775 0.805 0.802 (0.796) 0.832 (0.818) 0.751 0.789 0.817 0.817 (0.812) 0.838 (0.823)
16 000 0.741 0.777 0.806 0.803 (0.797) 0.834 (0.817) 0.751 0.790 0.819 0.818 (0.813) 0.841 (0.822)
(b) SVM:
200 0.315 0.476 0.536 0.542 (0.584) 0.644 (0.601) 0.259 0.410 0.478 0.489 (0.551) 0.635 (0.592)
600 0.422 0.568 0.637 0.654 (0.678) 0.689 (0.599) 0.352 0.510 0.594 0.617 (0.661) 0.692 (0.600)
1000 0.474 0.616 0.686 0.706 (0.725) 0.698 (0.603) 0.408 0.570 0.661 0.683 (0.718) 0.702 (0.606)
2000 0.547 0.657 0.723 0.746 (0.755) 0.715 (0.609) 0.492 0.619 0.709 0.735 (0.755) 0.721 (0.614)
4000 0.615 0.699 0.757 0.784 (0.787) 0.740 (0.636) 0.591 0.679 0.753 0.784 (0.791) 0.744 (0.640)
6000 0.636 0.711 0.766 0.797 (0.796) 0.748 (0.658) 0.623 0.695 0.765 0.799 (0.801) 0.752 (0.662)
8000 0.646 0.723 0.775 0.804 (0.805) 0.757 (0.675) 0.640 0.714 0.776 0.809 (0.811) 0.761 (0.679)
10 000 0.649 0.733 0.781 0.812 (0.808) 0.761 (0.679) 0.645 0.727 0.782 0.817 (0.815) 0.764 (0.682)
12 000 0.654 0.742 0.787 0.816 (0.813) 0.767 (0.694) 0.651 0.739 0.789 0.821 (0.819) 0.770 (0.698)
14 000 0.655 0.744 0.789 0.819 (0.814) 0.767 (0.697) 0.653 0.742 0.792 0.824 (0.821) 0.770 (0.700)
16 000 0.656 0.746 0.791 0.820 (0.816) 0.770 (0.699) 0.654 0.744 0.794 0.825 (0.822) 0.773 (0.703)

methods quite differently relying mainly on the characteristics of significant at 0.01 level. Therefore, the above results verify the su-
dataset and the classifier used. periority of our proposed methods. Meanwhile, we are convinced
For the purpose of the statistical significance tests, the paired- that at least one of our proposed UTW methods (e.g., DTF–IPF)
sample t-test is used to test the performance differences in this performs better than famous TF–IDF for all datasets and classi-
study. The results of the SVM classifier are selected for this test, fiers, and sometimes even outperforms the STW schemes. This
mainly due to the performances achieved by SVM is superior to indicates that in addition to the logarithmic and inverse expo-
nential transformation methods, all the alternative methods we
MNB classifier for both dataset in general. Table 8 gives the results
proposed in this study are effective.
of two-tailed test, where T -crit stands for the critical value of T.
The symbols ‘*’ and ‘**’ are statistically significant at 0.01 and 0.05
6. Conclusions
levels, respectively.
For Reuters-21578 dataset, we can see from Table 8 that the The overall purpose of this work is to resolve the three ques-
difference of performances (i.e., macro-F1 and micro-F1) between tions appeared in this paper, and they have been successfully
DTF–IPF and other schemes is statistically significant at 0.01 level achieved through three groups of experiments on two bench-
in most cases. For 20 Newsgroups dataset, we should note that mark datasets with MNB and SVM classifiers. Concerning the first
except for the difference between DTF–IPF and RTF-IGMimp is sta- question, we introduced four nonlinear transformation meth-
tistically significant at 0.05 level, the other pairs are statistically ods as global factor in the UTW schemes, this is just like IDF
Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399 13

Table 8
Results of paired-sample t-test.
Pair Metrics Reuters-21578 20 Newsgroups
T-value T-crit P-value T-value T-crit P-value
DTF–IPF vs. TF–IDF Macro-F1 6.736405 3.249836 8.49E−05∗ 11.436813 3.169273 4.59E−07∗
Micro-F1 6.550742 3.249836 0.000105∗ 7.099215 3.169273 3.30E−05∗
TF–IEF Macro-F1 2.385080 2.262157 0.040886∗∗ 6.689453 3.169273 5.44E−05∗
Micro-F1 0.452267 2.262157 0.661780 3.772395 3.169273 0.003647∗
TF–IHF Macro-F1 3.410414 3.249836 0.007745∗ 6.832566 3.169273 4.55E−05∗
Micro-F1 1.094095 2.262157 0.302329 3.836296 3.169273 0.003285∗
TF–ISF Macro-F1 5.025502 3.249836 0.000714∗ 7.609999 3.169273 1.82E−05∗
Micro-F1 3.077935 2.262157 0.013184∗∗ 4.327321 3.169273 0.001496∗
TF–ITF Macro-F1 3.457717 3.249836 0.007187∗ 7.664177 3.169273 1.71E−05∗
Micro-F1 1.860521 2.262157 0.095734 4.364943 3.169273 0.001410∗
TF–RF Macro-F1 5.257691 3.249836 0.000522∗ 20.710742 3.169273 1.52E−09∗
Micro-F1 4.205439 3.249836 0.002288∗ 12.57071 3.169273 1.89E−07∗
TF–IDF–ICSDF Macro-F1 21.31774 3.249836 5.17E−09∗ 4.586247 3.169273 0.001000∗
Micro-F1 16.39843 3.249836 5.19E−08∗ 3.961653 3.169273 0.002679∗
RTF-IGMimp Macro-F1 4.761036 3.249836 0.001028∗ 4.180508 3.169273 0.001886∗
Micro-F1 2.643312 2.262157 0.026768∗∗ 2.598144 2.228139 0.026576∗∗

and three typical STW approaches (i.e., TF–RF, TF–IDF–ICSDF, and


RTF-IGMimp ) are analyzed when test texts are represented. In
TF–IDF and TF–RF methods, there is only one way to represent
test text namely TF–IDF method. However, there are two ways
to represent test text in TF–IDF–ICSDF and RTF-IGMimp methods.
That is, they are both used to represent the training text and
test text, or only used to represent the training text while the
test text still uses the TF–IDF method. The effectiveness of these
test text representation methods is demonstrated by comparing
with previous findings. Comparisons also show that our UTW
schemes are sometimes better than the STW schemes. For the
third question, we evaluated the effect with different distortion
parameters rtf and ripf , the results show that using the square root
of TF may not greatly improve the performance of TC, while the
proportional distortion function can lead to better performance
than the square root of TF.
Fig. 10. Performance comparison on the Reuters-21578 dataset.
CRediT authorship contribution statement

Zhong Tang: Conceptualization, Methodology, Software,


Writing - original draft. Wenqiang Li: Funding acquisition,
Data curation. Yan Li: Validation, Supervision. Wu Zhao:
Funding acquisition, Investigation. Song Li: Writing - review
& editing.

Declaration of competing interest

The authors declare that they have no known competing


financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and


Development Program, China (No. 2018YFB1700702), the Sci-
ence & Technology Ministry Innovation Method Program, China
(No. 2017IM040100), the Sichuan Major Science and Technology
Fig. 11. Performance comparison on the 20 Newsgroups dataset. Project, China (No. 2019ZDZX0001), and the Sichuan Applied
Foundation Project, China (No. 2018JY0119).

in TF–IDF which adopts logarithmic transformation. Compari- References


son of the proposed schemes with the popular TF–IDF indicates
that our methods can obtain competitive performance, and our [1] H. Al-Mubaid, S.A. Umair, A new text categorization technique using
DTF–IPF scheme consistently shows the best performance among distributional clustering and learning logic, IEEE Trans. Knowl. Data Eng.
18 (9) (2006) 1156–1165.
these UTW methods. Regarding the second question, we formally
[2] Z. Erenel, H. Altınçay, Nonlinear transformation of term frequencies for
explained how to represent test texts especially for the STW term weighting in textcategorization, Eng. Appl. Artif. Intell. 25 (7) (2012)
methods. As an illustration, a classic UTW method (i.e., TF–IDF) 1505–1514.
14 Z. Tang, W. Li, Y. Li et al. / Knowledge-Based Systems 207 (2020) 106399

[3] C.X. Shang, M. Li, S.Z. Feng, et al., Feature selection via maximizing [26] X.J. Quan, W.Y. Liu, B.T. Qiu, Term weighting schemes for question
global information gain for text classification, Knowl.-Based Syst. 54 (2013) categorization, IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2011)
298–309. 1009–1021.
[4] E.S. Tellez, D. Moctezuma, S. Miranda-Jiménez, et al., An automated [27] T. Dogan, A.K. Uysal, Improved inverse gravity moment term weighting for
text categorization framework based on hyperparameter optimization, text classification, Expert Syst. Appl. 130 (2019) 45–59.
Knowl.-Based Syst. 149 (2018) 110–123. [28] Z. Tang, W.Q. Li, Y. Li, An improved term weighting scheme for text
[5] F. Sebastiani, Machine learning in automated text categorization, Acm classification, Concurr. Comput.: Pract. Exper. 32 (2020) e5604.
Comput. Surv. 34 (1) (2002) 1–47. [29] S. Robertson, Understanding inverse document frequency: on theoretical
[6] B. Tang, S. Kay, H.B. He, Toward optimal featureselection in Naive Bayes for arguments for IDF, J. Doc. 60 (5) (2004) 503–520.
text categorization, IEEE Trans. Knowl. Data Eng. 28 (9) (2016) 2508–2521. [30] T. Wang, Y. Cai, H.F. Leung, et al., Entropy-based term weighting schemes
[7] M. Haddoud, A. Mokhtari, T. Lecroq, et al., Combining supervised for text categorization in VSM, in: Proceedings of the 27th International
term-weighting metrics for SVM text classification with extended term Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2015.
representation, Knowl. Inf. Syst. 49 (3) (2016) 909–931. [31] Y. Ko, A new term-weighting scheme for text classification using the odds
[8] Z.C. Li, Z.Y. Xiong, Y.F. Zhang, et al., Fast text categorization using concise of positive and negative class probabilities, J. Assoc. Inf. Sci. Technol. 66
semantic analysis, Pattern Recognit. Lett. 32 (3) (2011) 441–448. (12) (2015) 2553–2565.
[9] D.Q. Wang, H. Zhang, Inverse-category-frequency based supervised term [32] R.A. Sinoara, J. Camacho-Collados, R.G. Rossi, et al., Knowledge-enhanced
weighting schemes for text categorization, J. Inf. Sci. Eng. 29 (2) (2013) document embeddings for text classification, Knowl.-Based Syst. 163
209–225. (2019) 955–971.
[10] G. Salton, A. Wong, C.S. Yang, A vector space model for automatic indexing, [33] S. Wang, Insurance pricing and increased limits rate making by pro-
Commun. ACM 18 (11) (1974) 613–620. portional hazards transforms, Insurance Math. Econom. 17 (1) (1995)
[11] M. Melucci, Vector-spacemodel, in: L. Liu, M.T. ÖZsu (Eds.), Encyclopedia 43–54.
of Database Systems, Springer US, Boston, MA, 2009, pp. 3259–3263. [34] J.H. Paik, A novel TF-IDF weighting scheme for effective ranking, in:
[12] Ş. Taşcı, T. Güngör, Comparison of text feature selection policies and using Proceedings of the 36th International ACM SIGIR conference on Research
an adaptive framework, Expert Syst. Appl. 40 (12) (2013) 4871–4886. and Development in Information Retrieval, ACM, 2013, pp. 343–352.
[13] H.T. Nguyen, P.H. Duong, E. Cambria, Learning short-text semantic similar- [35] C. Buckley, G. Salton, J. Allan, et al., Automatic query expansion using
ity with word embeddings and external knowledge sources, Knowl.-Based SMART: TREC3, in: Proceedings of the Third Text Retrieval Conference,
Syst. 182 (2019) 104842. pp. 69-80.
[14] F.J. Ren, M.G. Sohrab, Class-indexing-based term weighting for automatic [36] T. Dogan, A.K. Uysal, On term frequency factor in supervised term
text classification, Inform. Sci. 236 (1) (2013) 109–125. weighting schemes for text classification, Arab. J. Sci. Eng. 44 (2019)
[15] D.S. Guru, M. Suhil, L.N. Raju, et al., An alternative framework for univariate 9545–9560.
filter based feature selection for text categorization, Pattern Recognit. Lett. [37] A. Cardoso-Cachopo, Improving Methods for Single-Label Text Catego-
103 (2018) 23–31. rization (Ph.D. thesis), Instituto Superior Técnico-Universidade Técnica de
[16] I. Alsmadi, G.K. Hoon, Term weighting scheme for short-text classification: Lisboa, Portugal, 2007.
Twitter corpuses, Neural Comput. Appl. 31 (8) (2019) 3819–3831. [38] M.F. Porter, An algorithm for suffix stripping, Program 40 (3) (2006)
[17] M. Lan, C.L. Tan, J. Su, et al., Supervised and traditional term weighting 211–218.
methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. [39] J.N. Meng, H.F. Lin, Y.H. Yu, A two-stage feature selection method for text
Intell. 31 (4) (2009) 721–735. categorization, Comput. Math. Appl. 62 (7) (2011) 2793–2800.
[18] F. Debole, F. Sebastiani, Supervised term weighting for automated text cat- [40] S.P. Wang, W. Pedrycz, Q.X. Zhu, et al., Subspace learning for unsupervised
egorization. In Proceedings of the ACM Symposium on Applied Computing, feature selection via matrix factorization, Pattern Recognit. 48 (1) (2015)
2003, 784-788.. 10–19.
[19] K.S. Jones, A statistical interpretation of term specificity and its application [41] Y.M. Yang, J.O. Pedersen, A comparative study on feature selection in text
in retrieval, J. Doc. 28 (1) (1972) 11–21. categorization, in: Proceedings of the 14th International Conference on
[20] H.B. Wu, X.D. Gu, Y.W. Gu, Balancing between over-weighting and under- Machine Learning, Nashville, USA, 1997, pp. 412-420.
weighting in supervised term weighting, Inf. Process. Manage. 53 (2) [42] A. McCallum, K. Nigam, A comparison of event models for naive Bayes text
(2017) 547–557. classification, in: Proceedings of the AAAI/ICML Workshop on Learning for
[21] H.J. Escalante, M.A. García-Limón, A. Morales-Reyes, et al., Term-weighting Text Categorization on Working Notes of the 1998, AAAI Press, 1998, pp.
learning via genetic programming for text classification, Knowl.-Based Syst. 41–48.
83 (2015) 176–189. [43] S.B. Kim, K.S. Han, H.C. Rim, et al., Some effective techniques for Naive
[22] K.W. Chen, Z.P. Zhang, J. Long, et al., Turning from TF-IDF to TF-IGM for Bayes text classification, IEEE Trans. Knowl. Data Eng. 18 (11) (2006)
term weighting in text classification, Expert Syst. Appl. 66 (33) (2016) 1457–1466.
245–260. [44] M. Labani, P. Moradi, F. Ahmadizar, et al., A novel multivariate filter
[23] H. Altınçay, Z. Erenel, Analytical evaluation of term weighting schemes for method for feature selection in text classification problems, Eng. Appl.
text categorization, Pattern Recognit. Lett. 31 (11) (2010) 1310–1323. Artif. Intell. 70 (2018) 25–37.
[24] Y. Liu, H.T. Loh, A. Sun, Imbalanced text classification: A term weighting [45] Y.M. Yang, An evaluation of statistical approaches to text categorization,
approach, Expert Syst. Appl. 36 (1) (2009) 690–701. Inf. Retr. 1 (1–2) (1999) 69–90.
[25] V. Lertnattee, T. Theeramunkong, Analysis of inverse class frequency in [46] T. Sabbah, A. Selamat, M.H. Selamat, et al., Modified frequency-based term
centroid-based text classification, in: Proceedings of the 4th International weighting schemes for text classification, Appl. Soft Comput. 58 (2017)
Symposium on Communication and Information Technology, 2004, pp. 193–206.
1171-1176.

You might also like