Transformer and Graph Convolutional Network For Text Classification

International Journal of Computational Intelligence Systems (2023) 16:161
https://doi.org/10.1007/s44196-023-00337-z
RESEARCH ARTICLE
Transformer and Graph Convolutional Network for Text Classification

Boting Liu1 · Weili Guan2 · Changjin Yang1 · Zhijie Fang3 · Zhiheng Lu4
Received: 27 February 2023 / Accepted: 8 September 2023

© The Author(s) 2023
Abstract
Graph convolutional network (GCN) is an effective tool for feature clustering. However, in the text classification task, the
traditional TextGCN (GCN for Text Classification) ignores the context word order of the text. In addition, TextGCN constructs
the text graph only according to the context relationship, so it is difficult for the word nodes to learn an effective semantic
representation. Based on this, this paper proposes a text classification method that combines Transformer and GCN. To
improve the semantic accuracy of word node features, we add a part of speech (POS) to the word-document graph and build
edges between words based on POS. In the layer-to-layer of GCN, the Transformer is used to extract the contextual and
sequential information of the text. We conducted the experiment on five representative datasets. The results show that our
method can effectively improve the accuracy of text classification and is better than the comparison method.
Keywords Graph convolutional network · Text classification · Part of speech · Transformer
1 Introduction all quality of life. Hence, the essence of this paper’s research
lies in refining existing text classification techniques, focus-
Text classification plays a pivotal role in natural language ing on improving the accuracy of text classification.
processing (NLP) [1]. It involves the automated categoriza- Existing text classification technologies are mainly based
tion of text through computer technology, finding extensive on machine learning and deep learning methods [2]. Com-
application in sentiment analysis, document classification, mon machine learning methods in text classification tasks
and public opinion analysis, among other domains. Elevat- include Support Vector Machine (SVM) [3], K-Nearest
ing the precision of text classification tasks holds the key to Neighbor (KNN) [4], and Random Forest (RF) [5], which
resolving pertinent real-world issues and enhancing the over- have achieved excellent performance in simple classification
tasks [5]. However, machine learning methods based on sta-
B Weili Guan tistical techniques have difficulty in achieving the desired
2113391032@st.gxu.edu.cn
performance on complex tasks in real life, such as medical
B Zhiheng Lu text diagnosis and sentiment analysis. With the breakthrough
8812316@163.com
development of word vector technology in deep learning,
Boting Liu words are given contextual semantics in the form of vec-
971328422@qq.com
tors [6]. Deep learning methods based on word vectors have
Changjin Yang achieved excellent NLP results and gradually become known
503141588@qq.com
as the mainstream method in the field of NLP [1]. In the
Zhijie Fang field of text classification, the most commonly used deep
nnfang@163.com
learning methods are Convolutional Neural Network (CNN)
1 School of Computer, Electronics and Information, Guangxi [7], Recurrent Neural Network (RNN) [8], transformer [9],
University, Nanning 530004, Guangxi, China and Graph Convolutional Network (GCN) [10]. Due to the
2 College of Digital Economics, Nanning University, Nanning limitation of CNN convolutional kernel size, CNN focuses
530299, Guangxi, China more on extracting local feature information of text. RNN
3 College of Electrical Engineering, Guangxi University of takes into account the role of each token in text, so RNN
Science and Technology, Liuzhou 545006, Guangxi, China focuses more on extracting global feature information of
4 School of Mechanical Engineering, Guangxi University, text, but there is a risk of gradient disappearance in RNN.
Nanning 530004, Guangxi, China
0123456789().: V,-vol 123

161 Page 2 of 11 International Journal of Computational Intelligence Systems (2023) 16:161
Transformer is a powerful feature selection tool that com- ument classification performance. TextGCN, a model based
bines attention mechanisms to obtain stronger contextual on semi-supervised learning, has enhanced the training dif-
relationships for text, and is an innovative technique in NLP. ficulty to a certain extent. This enables TextGCN to achieve
TextGCN achieves text classification by constructing a word- good fitting performance through only two layers of graph
document graph structure, and GCN focuses more on the convolution. Moreover, in TextGCN, documents and words
spatial feature information of the text. Also, in the study [10], form a heterogeneous graph structure, allowing for the learn-
GCN achieves better text classification accuracy than CNN ing of information at the word and document levels. Since
and RNN. Therefore, we believe that GCN has great potential then, more and more researchers have applied GCN to text
for text classification tasks. We will combine Transformer to classification [11].
improve GCN to obtain higher performance for text classifi-
cation.
Nodes in the GCN simultaneously assimilate informa- 2.2 Recent Works
tion from their neighboring nodes. However, this approach
implies that in a text classification task, document nodes The proposal [12] proposes a document-level GCN-based
consider all words within the document simultaneously, text classification method. Unlike TextGCN, the proposal
disregarding the text’s sequential order. Varied sentence [12] constructs each document as a separate graph. The
structures convey nuanced meanings, underscoring the sig- computational cost of GCN is optimized to achieve bet-
nificance of preserving text order. Consequently, we posit ter classification performance than TextGCN and to support
that enhancing GCN’s efficacy in text classification necessi- online classification of documents. A text classification
tates imbibing knowledge about text sequences. Moreover, method based on text graph tensor is proposed in the proposal
the scope of semantics attainable solely through contextual [13]. The proposal [13] uses three different compositions,
relationships in word-document graphs is inherently limited. semantic, syntactic, and sequential, to coordinate the infor-
Building upon this premise, our paper introduces a novel mation between different types of graphs and achieve a
text classification approach that amalgamates Transformer better classification performance than TextGCN. A text clas-
and GCN. This fusion capitalizes on the strengths of both sification method based on GCN with Bidirectional Long
models. The principal contributions of our study encompass Short-Term Memory (BiLSTM) is proposed in the proposal
the following aspects: [14], which is called IMGCN. The proposal [14] used Word-
net [15] with syntactic dependency composition method
• To tackle the issue of GCN overlooking textual order, and used BERT to get the embedding representation of
we seamlessly integrate the Transformer into the graph word nodes. Bidirectional LSTM with Attention was used
convolutional layers, forming what we refer to as a to further extract the contextual relationship of the text and
Graph Convolution Layer-Transformer-Graph Convo- combined with residual concatenation to get the classification
lution Layer (GTG). The Transformer enhances the results. A text classification method combining BiGRU and
contextualization of textual information, considering the GCN is proposed in the proposal [16]. The word embedding
crucial textual order aspect. The resultant Transformer representation is obtained by Word2vec [17], the contextual
output is amalgamated with GCN to yield a more precise information of the text is extracted by Bidirectional Gating
semantic representation of document nodes. Recurrent Unit (BiGRU) [18], and the spatial information of
• To address the issue of limited semantic information the text is extracted by the input GCN. A short text classifi-
in word node vectors within GCN, we suggest con- cation method based on GCN and BERT [19] was proposed
structing word-document graphs based on POS tagging. in the proposal [20]. A word-document-topic graph struc-
This approach imbues words with POS-related seman- ture was constructed using Biterm Topic Model (BTM) [21]
tics, thereby enhancing the overall semantic quality of to obtain the topics of documents. The word node features
word node vectors. after GCN iteration are fused with the word features out-
put from BERT and input to BiLSTM. BiLSTM will extract
the contextual semantics of the text and finally fuse with the
2 Related Work document node features to get the classification results. The
proposal [22] proposed a text classification model based on
2.1 TextGCN BERT with GCN. They initialize the node vector of GCN by
BERT and jointly train GCN and BERT to fully utilize the
In the early days, GCN was mainly applied to tasks with advantages of each model. A GCN text classification method
obvious spatial structure, such as social networks and knowl- based on inductive graphs was proposed in the proposal
edge graphs. In 2019, Yao et al. [10] applied GCN to text [23]. The original dataset was statistically summarized into
classification tasks for the first time and achieved good doc- small graphs, and good classification results were obtained
123
International Journal of Computational Intelligence Systems (2023) 16:161 Page 3 of 11 161
Table 1 Comparison between related work

Method Year Highlights Limitations
TextGCN [10] 2019 The concept of The position information of

word-document graph the word nodes was
was proposed ignored
Text-level GCN [12] 2019 Text-level graphs was The position information of
constructed to optimize the word nodes was
the computational cost ignored
Tensor GCN [13] 2020 A multi-angle building The position information of
graph idea was proposed the word nodes was
ignored
BiLSTM+GCN [20] 2020 The topic of the text was The position information of
obtained through BTM the word nodes was
and the text topic was ignored
used as a graph node
BERT+GCN [22] 2021 Initialization of word node The position information of
embedding representation the word nodes was
by BERT ignored
IMGCN [14] 2022 A graph structure of The position information of
dependency and semantic the word nodes was
dictionary is introduced ignored
by wordnet
BiGRU+GCN [16] 2022 A hybrid structure of The position information of
BiGRU and GCN is the word nodes was
proposed ignored
based on the small graphs alone. In Table 1, we have briefly

described the highlights and limitations of related work.
All of the aforementioned studies have built upon the
foundation laid by TextGCN [10]. They have integrated addi-
tional networks or utilized diverse configurations as their
primary focus, aligning closely with the direction of this
research. Nevertheless, as indicated in Table 1, none of these
approaches appear to have addressed the issue of textual
ordering. Therefore, this paper aims to rectify the limitations
of GCN concerning the aspect of text sequence.
3 Transformer and Graph Convolutional

Network for Text Classification
3.1 Method Structure
The text classification method based on Transformer and

GCN including data pre-processing and GTG, and the model
structure is shown in Fig. 1.
Data pre-processing The initial step involves eliminat-
Fig. 1 The text classification structure based on Transformer and GCN.
ing irrelevant words from the dataset, such as adverbs and In this figure, the term “doc” denotes a document, while “text” refers
adjectives, by referencing the list of stop words. In alignment to the textual content within the dataset
with TextGCN [10], the identical stop words list is employed.
Subsequently, the construction of the word-document global
co-occurrence graph is rooted in contextual relationships.
Further elaboration on this process can be found in Sect. 3.2.
123
Fig. 2 Building graph based on POS. As shown in this figure, words of the same POS nature are linked by edges, allowing the words to obtain a
POS-based semantic representation
Fig. 3 Building graph based on context. Compute the relationships between words in the sliding window, so that the word nodes have context-based
semantic representations
GTG After constructing a word-document graph, the same window. The detailed processing procedure is illus-
graph node features undergo initial updates following the trated in Fig. 3.
application of the first graph convolution layer (GCL). Sub- After the processing illustrated in Fig. 3, we have success-
sequently, the word nodes are input into the Transformer to fully derived the word-to-word relationships. Subsequently,
extract contextual semantics, along with the text’s seman- we proceed to establish word-to-word edges based on con-
tic order information. Ultimately, the Transformer’s output textual information. The assignment of weights to these
is integrated with the document nodes to augment features, word-to-word edges is determined following Eqs. (1), (2),
forming the input for the second GCL. and (3)
Ni
p(i) = (1)
Nw
3.2 Data Preprocessing Ni j
p(i, j) = (2)
Nw
Our data pre-processing methodology closely follows the p(i, j)
approach outlined in [10], with a modification in the struc- P M I (i, j) = . (3)
p(i) p( j)
ture to enable the acquisition of POS-related information by
words. In Eqs. (1), (2), and (3), Nw represents the total num-
To begin, we segment the words within each document. ber of sliding windows, Ni corresponds to the frequency of
Subsequently, we employ the Natural Language Toolkit occurrence of term i across all sliding windows, and Ni j indi-
(NLTK) [24] for POS tagging of the words. Upon analyz- cates the co-occurrence frequency of terms i and j within the
ing each document, we establish connections between words same sliding windows. The Pointwise Mutual Information
sharing the same POS tag. All connections between words of (PMI) [25] is employed to quantify the relationship between
identical POS nature are assigned equal significance, result- these two terms, with higher PMI values indicating a stronger
ing in an edge weight of 1. A detailed illustration of this association between them. Therefore, a PMI greater than 0
processing procedure is presented in Fig. 2. signifies a substantial correlation between the two words,
Subsequently, we will establish word relationships grounded leading to the establishment of edges with assigned weights.
in context. Each document is scanned using a window of Next, we establish connections between documents and
length 20, and we capture the frequency of occurrences for words, treating each document as a node within the graph.
individual words within this window. Additionally, we tally Document nodes form connections with the words present
the frequency of adjacent word pairs appearing within the in the respective documents. The weights assigned to these
123
Fig. 4 Word-document graph

based on POS and context.
Using consistent colors to
represent identical parts of
speech, with D symbolizing the
document node
connections between documents and words are determined information within the text. The specifics of this approach are
using Term Frequency-Inverse Document Frequency (TF- illustrated in Fig. 5.
IDF) [26], with the corresponding formula presented in Eqs. In Fig. 5, the output of the Transformer is fused with the
(4), (5), and (6) document node vector, which is shown in Eq. (7)
Mi Outdoc = (Out Transformer + Out 1st−doc )/2. (7)

TF = (4)
Md
In the above equation, we fuse the output of the Trans-
MD
IDF = log (5) former with the document vector of the first Graph Convo-
Mid
lution Layer in a summation-averaging manner. We use a
TF − IDF = TF × IDF. (6)
smoother Mish [27] to make Out Transformer and Out 1st−doc
blend better. The Mish function is defined as Eq. (8)
In the equations provided above, Mi represents the frequency
of occurrence of term i within the current document, Md Mish = x ∗ tan h(ln(1 + e x )). (8)
signifies the total word count in the current document, M D
corresponds to the total number of documents, and Mid In Eq. (8), where x signifies the input features, tanh denotes
stands for the count of documents containing the term i. the hyperbolic tangent function, and ln stands for the natural
TF-IDF serves as a fundamental metric to assess word signif- logarithm. Subsequently, we substitute the initial document
icance in the context of document classification, with higher features with Out doc , followed by further convolving the
TF-IDF values indicating greater word importance within refined graph using the second Graph Convolutional Layer
documents. Following these initial steps, we constructed the to attain the classification outcome.
word-document graph, and the comprehensive graph struc-
ture is visually depicted in Fig. 4. 3.3.1 Transformer Encoder
At this point, our data pre-processing is complete, and the
next step is to process the word-document graph via the GTG Transformer is a powerful feature selection tool that incor-
network, as detailed in the next section. porates an attention mechanism to give words context-based
attention scores [28]. Transformer is based on an encoder to
3.3 GTG decoder structure; in this paper, we use only the encoder of
Transformer. The Transformer encoder structure is shown in
We integrated the Transformer between the GCN layers with Fig. 6.
the aim of not only extracting deeper contextual semantics In Fig. 6, the inclusion of position embedding introduces
from the word nodes but also capturing the semantic ordering positional information to individual tokens, thus enabling the
123
Transformer to consider the sequential order of tokens during

training. The implementation of position embedding in the
Transformer relies on trigonometric functions, as illustrated
in Eqs. (9) and (10).
The periodic nature of trigonometric functions effectively
captures the relative positions of words within a textual
sequence. Additionally, the application of trigonometric
formulas allows for the efficient calculation of positional
information in a concise manner. Due to their representation
as high-dimensional vectors, trigonometric functions align
well with matrix multiplication operations in both the Trans-
former and GCN, enhancing overall efficiency
PE(pos, 2i) = sin(pos/10002i/dmodel ) (9)

PE(pos, 2i + 1) = cos(pos/1000 2i/dmodel
). (10)
In the equations provided above, pos represents the index

value of the word’s position within the original document,
dmodel stands for the model’s dimensionality, and i corre-
sponds to the positional embedding index. When two words
exhibit strong trigonometric similarity, they are regarded as
being in proximity within the sentence. Figure 7 visually
depicts the fusion of the phrase “Natural language processing
is an art” with positional embedding information.
In Fig. 7, “pe” denotes the positional embedding infor-
mation. Positional embedding is a vector that aligns with the
dimensionality of the word nodes, obtained from Eqs. (9) and
(10). Figure 8 illustrates the attention heatmap of the phrase
"Natural language processing is an art" with the inclusion of
positional embedding.
In Fig. 8, it is evident that word nodes in close proxim-
Fig. 5 The GTG structure
ity acquire higher attention scores following multiplication.
This observation highlights that the incorporation of position
embedding imbues the word nodes with valuable positional
information. In multi-headed attention, the Transformer’s
input is mapped into several Scaled Dot-Product Attention
networks. The equation of Scaled Dot-Product Attention is
Eq. (11)

QK T
Attention(Q, K , V ) = Softmax √ V. (11)
dk
In Eq. (11), Softmax is the normalization function, Q rep-

resents the query vector, K is the queried vector, V is the
content vector, and d is the vector dimension. Here, Q, K, and
V are text sequence vectors composed of word nodes that are
multiplied by different parameter matrices. Therefore, what
is being calculated here is the self-attention between words
in the same context. Subsequent to the attention calculation,
the output from each head is amalgamated to form the out-
Fig. 6 The Transformer encoder. It includes position embedding,
put of the multi-head attention. This multi-headed attention
multi-headed attention, residual connectivity, normalization, and feed- mechanism captures attention distributions from multiple
forward networks perspectives, yielding superior outcomes compared to the
123
Fig. 7 An example of a word node with position embedding information
singular attention approach. In this research, we employ a

layer of the Transformer encoder. We concatenate the output
tokens from each Transformer and standardize their dimen-
sions before aligning them with the document node through
a linear layer. The precise formulations are illustrated in Eqs.
(12) and (13)
Out Transformer = Concat(token1 , . . . , tokenn ) (12)

Out Transformer = Linear(OutTransformer ). (13)
In the above formulas, Concat is the concatenation func-

tion and token is the word vector.
3.3.2 Graph Convolutional Network
GCN can be seamlessly employed to analyze graph data Fig. 8 The attention heat map of “Natural language processing is an
art”. The coordinate axes represent the word nodes, and the depth of
structures, effectively capturing spatial relationships among the matrix square color is positively correlated with the attention score
nodes and facilitating node classification [29]. Over the past of the word nodes
years, the potential of GCN in text classification has garnered
increasing attention from researchers, leading to its growing
adoption in various text classification tasks. comes, as illustrated in Eq. (16)
To facilitate efficient computations, GCN employs matrix
multiplication for all its operations, thereby representing and
processing graph structures as adjacency matrices. In the L (2) = ρ( ÂL (1) Wo ). (16)
initial graph convolutional layer (GCL), node updates are
determined by the Eqs. (14) and (15)
In Eq. (16), the feature input is refined as L (1) . Subse-
−1/2
Â = D 1/2
AD (14) quently, an additional convolution operation is applied to L (1)
L (1) = ρ( ÂX Wo ), (15) to extract more intricate word-document spatial information.
This refinement aims to amplify the clustering impact of the
where A is the adjacency matrix of the graph, ρ is the Relu document nodes, ultimately contributing to the accomplish-
function, D is the degree matrix of A, X is the node feature, ment of the classification task. Consistent with the approach
and Wo is the weight matrix. proposed in [10], we maintain a node dimension of 300 in
After the initial GCL update, each node effectively assim- this study.
ilates information from its neighboring nodes, resulting in
nodes possessing specific spatial characteristics and exhibit-
ing a clustering effect. Subsequently, the textual nodes from
the initial layer are inputted into the Transformer to addition- 4 Experimental Results
ally extract contextual and sequential textual information.
The ensuing step involves dimensionality reduction through In this section, we will experimentally verify the effective-
the second GCL, yielding the ultimate classification out- ness and superiority of the method in this paper.
123
Table 2 The datasets’ information 4.3 Experimental Setting

Dataset Docs Training Test Classes Average length
Following the approach outlined in the proposal [10], we opt
MRa 10,662 7108 3554 2 20 for a random selection of 10% of the documents from the
R8b 7674 5485 2189 8 65 training dataset to form the validation set. Our training pro-
20NGc 18,846 11,314 7532 20 221 cess spans 200 epochs, employing a learning rate of 0.02,
R52d 9100 6532 2568 52 69 until the validation loss demonstrates no improvement for a
Ohsumede 7400 3357 4043 8 135 span of ten epochs. A dropout rate of 0.5 is applied, accom-
a MR is a sentiment analysis dataset of movie reviews, which contains panied by the utilization of the ReLU activation function.
positive and negative categories
b R8 is an 8-category news topic dataset
c 20NG is an 20-category news topic dataset
4.4 Baselines
d R52 is an 52-category news topic dataset
e Ohsumed is a text classification dataset in medicine, containing 23 In this section, we will briefly introduce the baseline models
categories of this paper.
Machine learning Machine learning-based methods have
been widely used in the field of text classification. We choose
4.1 Experimental Datasets SVM, KNN, and RF as machine learning methods, and their
text features are initialized by TF-IDF.
We selected R8, 20ng, MR, R52, and Ohsumed as experi- Deep learning We choose BiLSTM, BiGRU, CNN, Trans-
mental data sets, which are representative in this field. The former, and FastText [31] as the deep learning methods. The
information about the datasets is shown in Table 2. last output of BiLSTM and BiGRU is used as the classifica-
As depicted in Table 2, these datasets encompass diverse tion token, and the classification token is fed into the linear
domains including long text, short text, and sentiment anal- layer to get the prediction. The CNN uses the version in
ysis. These domains collectively provide a comprehensive TextCNN with convolution kernel of (2, 3, 4). In Transformer,
representation of the text classification field. Given that the all the output tokens are stitched to the same latitude and the
Transformer necessitates text inputs of fixed length, we adjust prediction is obtained by a linear layer. The word vectors of
each document by either truncating or padding it based on RNN, Transformer, and CNN methods are initialized by the
the average length of the dataset. pre-trained GloVe [32]. In addition, FastText classification
is performed by summing and averaging the word vectors
obtained from training and obtaining predictions through a
4.2 Experimental Evaluation Index linear layer.
Recent related works We choose TextGCN [10], BiGRU+
Accuracy and f1 are used as experimental evaluation indica- GCN [16], and BiLSTM+GCN [20] as the comparison meth-
tors. The calculation methods of accuracy and f1 are shown ods. To compare the structural advantages and disadvantages
in Eqs. (17), (18), (19), and (20) of each method, we uniformly initialize the node features
with one-hot.
TP + TN 4.5 Results
Accuracy = (17)
TP + TN + FP + FN
TP In this section, we present the pertinent experimental find-
Precision = (18) ings along with a concise analysis of these results. The test
TP + FP
TP accuracies of each approach are displayed in Table 3. For
Recall = (19) the document classification task in this study, we evalu-
TP + FN
2 ∗ Precision ∗ Recall ated test accuracy and F1 score through ten iterations across
F1 = . (20) all models. The outcomes were reported as the mean value
Precision + Recall
accompanied by the standard deviation. In Tables 3 and 4,
the bolded results proved to be significantly better than the
True Positives (TP) is the number of positive classes pre- other methods in this dataset by t-test.
dicted; False Positives (FP) is the number of negative classes As indicated in Table 3, the proposed approach demon-
predicted to be positive classes; True Negatives (TN) is the strates optimal classification performance across three datasets.
number of negative classes predicted; False Negatives (FN) Specifically, the proposed method achieves a classification
refers to the number of positive classes predicted to be neg- accuracy of 86.96% on 20NG, 94.46% accuracy on R52, and
ative [30]. 69.72% accuracy on Ohsumed. In comparison, the proposed
123
Table 3 The test accuracy (%)

Method 20NG R8 R52 Ohsumed MR
SVM 83.54 ± 1.66 96.71 ± 0.12 93.07 ± 0.59 63.00 ± 1.26 75.44 ± 0.32
KNN 67.78 ± 1.32 88.03 ± 0.86 85.44 ± 0.52 56.32 ± 1.39 70.15 ± 0.22
RF 77.54 ± 2.56 94.88 ± 1.53 87.58 ± 1.16 58.32 ± 1.32 69.41 ± 0.25
BiLSTM 73.20 ± 0.56 96.35 ± 1.32 90.39 ± 0.69 49.56 ± 1.22 77.53 ± 0.29
BiGRU 73.61 ± 0.36 96.55 ± 1.12 91.12 ± 0.62 49.11 ± 1.19 76.95 ± 0.33
CNN 82.25 ± 0.28 95.61 ± 0.77 87.56 ± 0.86 58.64 ± 1.02 77.62 ± 0.66
Transformer 74.26 ± 0.86 96.47 ± 1.32 92.12 ± 1.12 52.31 ± 0.94 76.56 ± 0.65
FastText 79.52 ± 0.46 94.59 ± 0.88 90.86 ± 0.34 55.61 ± 0.36 76.31 ± 0.52
TextGCN [10] 86.26 ± 0.16 96.80 ± 0.13 93.61 ± 0.14 68.32 ± 0.29 76.00 ± 0.48
TextGCN(POS) 86.38 ± 0.11 97.02 ± 0.12 93.54 ± 0.16 68.47 ± 0.42 76.53 ± 0.44
BiGRU+GCN [16] 86.77 ± 0.14 97.06 ± 0.13 93.88 ± 0.18 68.44 ± 0.33 77.56 ± 0.46
BiLSTM+GCN [20] 86.55 ± 0.12 97.38 ± 0.16 94.20 ± 0.18 69.15 ± 0.36 78.24 ± 0.44
Ours 86.96 ± 0.09 97.22 ± 0.10 94.46 ± 0.08 69.72 ± 0.13 77.24 ± 0.30
The proposed method demonstrated a significantly superior performance compared to the baselines on datasets including 20NG, R52 and Ohsumed,
as determined by a student t test ( p < 0.05)
Table 4 The macro f1-score on test set (%)

Method 20NG R8 R52 Ohsumed MR
SVM 83.26 ± 1.26 89.12 ± 1.06 68.64 ± 0.05 62.66 ± 0.62 76.32 ± 0.04
KNN 66.59 ± 2.14 82.64 ± 2.26 66.32 ± 1.12 53.12 ± 1.11 70.21 ± 0.12
RF 77.14 ± 1.36 86.64 ± 1.53 65.36 ± 2.16 52.61 ± 1.06 69.26 ± 0.52
BiLSTM 73.65 ± 0.26 88.55 ± 1.41 69.36 ± 0.33 48.66 ± 0.22 77.26 ± 0.26
BiGRU 73.33 ± 0.38 88.62 ± 1.23 69.44 ± 0.62 48.99 ± 0.52 76.82 ± 0.13
CNN 82.06 ± 0.33 88.76 ± 0.63 69.55 ± 0.33 53.16 ± 0.62 77.60 ± 0.32
Transformer 74.88 ± 0.75 88.26 ± 0.86 68.88 ± 0.56 52.69 ± 0.41 75.96 ± 0.32
FastText 78.24 ± 0.36 90.64 ± 0.63 69.71 ± 0.14 54.88 ± 0.26 76.22 ± 0.46
TextGCN [10] 85.02 ± 0.06 92.88 ± 0.06 70.17 ± 0.07 61.45 ± 0.35 75.58 ± 0.34
TextGCN(POS) 85.12 ± 0.06 93.25 ± 0.12 70.42 ± 0.22 62.06 ± 0.12 76.49 ± 0.33
BiGRU+GCN [16] 85.45 ± 0.14 93.42 ± 0.26 70.66 ± 0.12 62.16 ± 0.13 77.52 ± 0.31
BiLSTM+GCN [20] 85.40 ± 0.10 94.33 ± 0.32 70.96 ± 0.05 62.32 ± 0.21 78.20 ± 0.31
Ours 85.69 ± 0.11 93.66 ± 0.47 71.22 ± 0.07 62.77 ± 0.43 77.02 ± 0.21
The proposed method demonstrated a significantly superior performance compared to the baselines on datasets including 20NG, R52, and Ohsumed,
as determined by Student’s t test ( p < 0.05)
method outperforms BiGRU+GCN and BiLSTM+GCN by However, on the MR and R8 datasets, our classification
0.19% and 0.41% on 20NG, surpasses BiGRU+GCN and performance lags behind the BiLSTM+GCN approach. This
BiLSTM+GCN by 0.58% and 0.26% on R52, and exceeds discrepancy suggests that the BTM within BiLSTM+GCN
BiGRU+GCN and BiLSTM+GCN by 1.28% and 0.57% on is more adept at capturing crucial information from shorter
Ohsumed. Furthermore, as detailed in Table 4, the proposed texts, revealing a limitation in our method’s performance with
method also attains the highest F1 scores on 20NG, R52, and concise texts. Despite this, our method outperforms Trans-
Ohsumed. These results underscore the superiority of the former and TextGCN across all datasets, showcasing how
method presented in this paper for text classification tasks. the GTG structure effectively amalgamates the strengths of
They also affirm that the Transformer exhibits more robust Transformer and GCN networks to enhance the model’s fea-
feature extraction capabilities than the RNN structure and ture extraction prowess.
achieves a more precise semantic representation of tokens. The inclusion of POS in TextGCN (POS) results in
performance enhancements across four datasets, as word
123
Fig. 9 The embedding of word nodes from the second GCL in max- Fig. 10 The embedding of word nodes from the second GCL in POS
imum value label. In the above figure, points with the same color label. In the above figure, points with the same color represent the same
represent the same document category POS tag
nodes encapsulate both contextual and POS-related seman-

tics. Additionally, our observations demonstrate that SVM 5 Conclusion
achieves commendable classification performance, often sur-
passing deep learning approaches. This underscores the In this study, we propose a new GCN structure called GTG,
effectiveness of machine learning in simpler tasks. which combines the advantages of Transformer and GCN.
Next, we store the output of the second GCL post-training By introducing positional embeddings, GTG considers the
and proceed to visualize the two-dimensional embeddings of word node’s text sequence in GCN, and the introduction of
word nodes. Employing t-SNE [33], we condense the word Transformer further extracts context information from word
embeddings into two dimensions, designating the highest nodes. In addition, we also propose a POS and context-based
value within the word vector dimension as the word label, as composition method to have the semantics of context and
depicted in Fig. 9. Within the t-SNE visualization, the hor- POS with word node vectors. The experimental results show
izontal and vertical axes signify t-SNE values utilized for that GTG effectively improves the text classification accu-
gauging point-to-point distances. racy of TextGCN, and the POS-based construction graph
In Fig. 9, it is evident that words sharing the same label are method enables the word nodes to obtain POS clustering
closely clustered, aligning with the findings in the referenced effect. The proposed method achieves state-of-the-art perfor-
proposal [10]. This indicates that words are predominantly mance on three datasets compared to the comparison method.
situated within the document categories. Word nodes con- The proposed method provides a solution to improve the
nected to document nodes are proximate to them. The shortcomings of TextGCN in text classification tasks.
semantic attributes of words are primarily molded by their Author Contributions Conceptualization, BL and WG; methodology,
immediate context, a distinctive trait of the GCN. To illus- BL and WG; writing—original draft preparation, BL, ZL, and WG;
trate the embedding visualization of word nodes, we utilize experiment, BL, ZL, and CY; data, BL and CY; project administration,
POS as labels, as depicted in Fig. 10. BL and ZF; visualization, BL and ZF; funding acquisition, ZF and
ZL. All authors have read and agreed to the published version of the
In Fig. 10, it is evident that nodes sharing the same POS are manuscript.
proximate within a limited span. This close proximity can be
perceived as the immediate context of the word. Words within Funding This research was funded by the National Natural Science
the same document naturally draw near, owing to their shared Foundation of China (No. 11864005), the Basic Ability Promotion
Project for Yong Teachers in Guangxi (2023KY0017), and the Spe-
contextual surroundings. Building upon this premise, they cific Research Project of Guangxi for Research Bases and Talents
are additionally influenced by their POS and tend to cluster (AD23026105).
around the corresponding POS within a given context. This
observation underscores that the approach presented in this Availability of Data and Materials Our experimental datasets from
https://github.com/iworldtong/text_gcn.pytorch.
paper imbues the processed word nodes with both contextual
and POS-related semantics.
Declarations
Conflict of Interest The authors declare no competing interests.
123
Open Access This article is licensed under a Creative Commons 16. Dong, Y., Yang, Z., Cao, H.: A text classification model based on
Attribution 4.0 International License, which permits use, sharing, adap- GCN and BiGRU fusion. In: Proceedings of the 8th International
tation, distribution and reproduction in any medium or format, as Conference on Computing and Artificial Intelligence, ACM, pp.
long as you give appropriate credit to the original author(s) and the 318–322 (2022)
source, provide a link to the Creative Commons licence, and indi- 17. Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
cate if changes were made. The images or other third party material 18. Fang, F., Hu, X., Shu, J., et al.: Text classification model based on
in this article are included in the article’s Creative Commons licence, multi-head self-attention mechanism and BiGRU. In: 2021 IEEE
unless indicated otherwise in a credit line to the material. If material Conference on Telecommunications, Optics and Computer Science
is not included in the article’s Creative Commons licence and your (TOCS), pp. 357–361. IEEE (2021)
intended use is not permitted by statutory regulation or exceeds the 19. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of
permitted use, you will need to obtain permission directly from the copy- deep bidirectional transformers for language understanding. arXiv
right holder. To view a copy of this licence, visit http://creativecomm preprint arXiv:1810.04805 (2018)
ons.org/licenses/by/4.0/. 20. Ye, Z., Jiang, G., Liu, Y., et al.: Document and word representations
generated by graph convolutional network and bert for short text
classification. In: ECAI 2020, IOS Press, pp. 2275–2281 (2020)
21. Huang, J., Peng, M., Li, P., et al.: Improving biterm topic model
References with word embeddings. World Wide Web 23(6), 3099–3124 (2020)
22. Lin, Y., Meng, Y., Sun, X., et al.: Bertgcn: transductive text classifi-
1. Kowsari, K., JafariMeimandi, K., Heidarysafa, M., et al.: Text clas- cation by combining gcn and bert. arXiv preprint arXiv:2105.05727
sification algorithms: a survey. Information 10(4), 150 (2019) (2021)
2. Mirończuk, M.M., Protasiewicz, J.: A recent overview of the state- 23. Wang, K., Han, S.C., Poon, J.: InducT-GCN: inductive graph con-
of-the-art elements of text classification. Expert Syst. Appl. 106, volutional networks for text classification. In: 2022 26th Interna-
36–54 (2018) tional Conference on Pattern Recognition (ICPR), pp. 1243–1249.
3. Goudjil, M., Koudil, M., Bedda, M., et al.: A novel active learning IEEE (2022)
method using SVM for text classification. Int. J. Autom. Comput. 24. Bird, S., Edward, L., et al.: Natural Language Processing with
15, 290–298 (2018) Python. O’Reilly Media Inc, Sebastopol (2009)
4. Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based 25. Bouma, G.: Normalized (pointwise) mutual information in collo-
framework for text categorization. Procedia Eng. 69, 1356–1364 cation extraction. Proc. GSCL 30, 31–40 (2009)
(2014) 26. Ramos, J.: Using tf-idf to determine word relevance in document
5. Shah, K., Patel, H., Sanghvi, D., et al.: A comparative analysis of queries. Proc. First Instr. Conf. Mach. Learn. 242(1), 29–48 (2003)
logistic regression, random forest and KNN models for the text 27. Misra, D.M.: A self regularized non-monotonic activation function.
classification. Augment. Hum. Res. 5, 1–16 (2020) arXiv preprint arXiv:1908.08681 (2019)
6. Li, Y., Yang, T.: Word embedding for understanding natural lan- 28. Soyalp, G., Alar, A., Ozkanli, K., et al.: Improving Text Classifi-
guage: a survey. Guide Big Data Appl. 26, 83–104 (2018) cation with Transformer. In: 2021 6th International Conference on
7. Vieira, J.P.A., Moura, R.S., An analysis of convolutional neural Computer Science and Engineering (UBMK), pp. 707–712. IEEE
networks for sentence classification. In: XLIII Latin American (2021)
computer conference (CLEI), vol. 2017. IEEE, pp 1–5 (2017) 29. Zhang, S., Tong, H., Xu, J., et al.: Graph convolutional networks:
8. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classi- a comprehensive review. Comput. Soc. Netw. 6(1), 1–23 (2019)
fication with multi-task learning. arXiv preprint arXiv:1605.05101 30. Feng, Y., Cheng, Y.: Short text sentiment analysis based on multi-
(2016) channel CNN with multi-head attention mechanism. IEEE Access
9. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you 9, 19854–19863 (2021)
need. Adv. Neural Inf. Process. Syst. 30, 5988–5999 (2017) 31. Joulin, A., Grave, E., Bojanowski, P., et al.: Bag of tricks for effi-
10. Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text cient text classification. arXiv preprint arXiv:1607.01759 (2016)
classification. Proc. AAAI Conf. Artif. Intell. 33(01), 7370–7377 32. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors
(2019) for word representation. In: Proceedings of the 2014 Conference
11. Malekzadeh, M., Hajibabaee, P., Heidari, M., Review of graph neu- on Empirical Methods in Natural Language Processing (EMNLP),
ral network in text classification. In: IEEE 12th Annual Ubiquitous ACL, pp. 1532–1543 (2014)
Computing, Electronics and Mobile Communication Conference 33. Van der Maaten, L., Hinton, G.: Visualizing high-dimensional data
(UEMCON), 2021, pp, 0084–0091. IEEE (2021) using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2018)
12. Huang, L., Ma, D., Li, S., et al.: Text level graph neural network
for text classification. arXiv preprint arXiv:1910.02356 (2019)
13. Liu, X., You, X., Zhang, X., et al.: Tensor graph convolutional
Publisher’s Note Springer Nature remains neutral with regard to juris-
networks for text classification. Proc. AAAI Conf. Artif. Intell.
dictional claims in published maps and institutional affiliations.
34(05), 8409–8416 (2020)
14. Xue, B., Zhu, C., Wang, X., et al.: The study on the text classi-
fication based on graph convolutional network and BiLSTM. In:
Proceedings of the 8th International Conference on Computing and
Artificial Intelligence, ACM, pp. 323–331(2022)
15. Fellbaum, C.: WordNet, Theory and Applications of Ontology:
Computer Applications, pp. 231–243. Springer, Dordrecht (2010)
123

Transformer and Graph Convolutional Network For Text Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transformer and Graph Convolutional Network For Text Classification

Uploaded by

Copyright:

Available Formats

International Journal of Computational Intelligence Systems (2023) 16:161

Transformer and Graph Convolutional Network for Text Classification

Received: 27 February 2023 / Accepted: 8 September 2023

Keywords Graph convolutional network · Text classification · Part of speech · Transformer

0123456789().: V,-vol 123

Table 1 Comparison between related work

TextGCN [10] 2019 The concept of The position information of

based on the small graphs alone. In Table 1, we have briefly

3 Transformer and Graph Convolutional

3.1 Method Structure

The text classification method based on Transformer and

Fig. 4 Word-document graph

Mi Outdoc = (Out Transformer + Out 1st−doc )/2. (7)

Transformer to consider the sequential order of tokens during

PE(pos, 2i) = sin(pos/10002i/dmodel ) (9)

In the equations provided above, pos represents the index

In Eq. (11), Softmax is the normalization function, Q rep-

Fig. 7 An example of a word node with position embedding information

singular attention approach. In this research, we employ a

Out Transformer = Concat(token1 , . . . , tokenn ) (12)

In the above formulas, Concat is the concatenation func-

3.3.2 Graph Convolutional Network

Table 2 The datasets’ information 4.3 Experimental Setting

Table 3 The test accuracy (%)

Table 4 The macro f1-score on test set (%)

nodes encapsulate both contextual and POS-related seman-

Conflict of Interest The authors declare no competing interests.

You might also like