Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Expert Systems With Applications 209 (2022) 118228

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

A reinforced active learning approach for optimal sampling in aspect term


extraction for sentiment analysis
Manju Venugopalan , Deepa Gupta *
Dept of Computer Science and Engineering, Amrita School of Engineering, Bengaluru, Amrita Vishwa Vidyapeetham, India

A R T I C L E I N F O A B S T R A C T

Keywords: Aspect level sentiment analysis is a fine grained task in sentiment analysis which identifies the product features
Active learning from an opinionated piece of text and maps the sentiment towards each of them. Supervised ML algorithms have
Reinforcement learning reported comparatively higher performance on aspect level sentiment analysis but at the cost of substantial
Sequential text labelling
qualitative labelled data. Data labelling for such fine grained tasks also demand domain knowledge and
Aspect term extraction
Deep learning
expertise. Hence a mechanism to extract a minimal informative subset which is almost representative of the
Optimal sampling entire data would be a breakthrough in bringing down the annotation costs to a large extent. The proposed
Data Annotation methodology puts forward an active learning based sampling strategy for aspect term extraction, a subtask in
aspect level sentiment analysis which identifies the product features. The sampling strategy is automated by
reinforcement learning which extracts an optimal sample from the entire unlabelled training data and hence
optimizes data annotation by reducing the time and effort linked to the labelling process. This work is of high
importance in a data driven era where companies invest a lot in collecting and annotating huge volumes of data.
The model has been experimented across the laptop and restaurant domains of SemEval (2014–2016) datasets.
The experiments proved that a considerable reduction of the training data size is achieved across different
datasets. The model trained on the data extracted by the proposed reinforced active learning model beats random
sampling by 9 to 17 points when evaluated on the F-measure of the extracted aspect terms and is almost on par
with the model trained on the entire training data by utilising hardly 9 to 13% of the entire training data across
the datasets experimented.

1. Introduction prominent research area. Aspect level sentiment analysis (Alamoudi &
Alghamdi, 2021; Obiedat, Al-Darras, Alzaghoul & Harfoushi, 2021) is a
Exponential data growth and realization of the strength of data has fine grained version that provides a sentiment summarization in terms of
paved the way for a data driven era. This in turn, has elevated the the product features/aspects that are talked about in the review. One of
importance of many commercial applications one such being sentiment the challenging tasks in aspect level sentiment analysis is identifying the
analysis (Alamoodi et al., 2021; Medhat, Hassan, & Korashy, 2014; aspects from the review text, referred to as aspect term extraction and
Basiri, Nemati, Abdar, Cambria, & Acharya, 2021; Venugopalan & the proposed model revolves around this task. For example, in the re­
Gupta, 2015). Sentiment Analysis is a field that emerged from an inborn view sentence “I recommend the meatballs and caprese salad and the beans
human curiosity to know what others feel or think about a product or on toast were a wonderful start to the meal!”, the aspect terms are meat­
service. It automates the process of extracting the sentiment or opinion balls, caprese salad, beans on toast and meal. There have been concrete
of the reviewer from a piece of text. Sentiment analysis, an important models experimented for aspect term extraction, mostly which are su­
metric for market analysis, has evolved from just imparting the overall pervised (Augustyniak, Kajdanowicz, & Kazienko, 2021; Kumar et al.,
polarity of sentiment as positive, negative or neutral to many fine 2021a; Akhtar, Garg, & Ekbal, 2020; Venugopalan, Gupta, & Bhatia,
grained forms. The field of sentiment analysis has been an active 2021) in nature. This owes to the fact that in machine learning (ML),
research area for the past two decades. The wide range of applications supervised models are highly preferred as they are based on ground
and the market demand for its fine grained versions attributed to the truth which is the awareness of the expected output values. A supervised
continuously evolving nature of the field and hence it still remains a learning algorithm learns from labelled training data where the model

* Corresponding author.
E-mail addresses: v_manju@blr.amrita.edu (M. Venugopalan), g_deepa@blr.amrita.edu (D. Gupta).

https://doi.org/10.1016/j.eswa.2022.118228
Received 23 February 2022; Received in revised form 11 May 2022; Accepted 17 July 2022
Available online 21 July 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

devices a function that best explains the relationship between inputs and & Li, 2021). RL is a process in which a series of actions is implemented
the expected outputs. Hence the performance of supervised models has which corresponds to a maximum reward. RL models have been
always outwitted unsupervised/semi-supervised models which is experimented in real time applications varying from automated driving
equally true for aspect term extraction. systems (Kiran et al., 2021), gaming (Li et al., 2021a), real time bidding
However, supervised ML models require voluminous labelled data (Xiao, Wang, Shahidehpour, Li, & Yan, 2020) to NLP applications like
and the quality of the labels is also very crucial. To add to the cost and machine translation (Zou, Huang, Xie, Dai & Chen, 2019) and text
time factors, domains expertise is also required for labelling in such fine- summarization (Keneshloo, Ramakrishnan & Reddy, 2019). A RL envi­
grained tasks. Data annotation is thus a cumbersome and time ronment is capable of replacing the “human in the loop” requirement by
consuming task for many such fine grained tasks in text labelling ap­ extracting the optimal subset in one shot which is very beneficial in real
plications like POS tagging (Xiaolong, 2006; Cing & Soe, 2020), NER world scenarios. A RL environment automates the active learning pro­
(Wang & Fan, 2009; Li, Sun, & Ma, 2020), semantic labelling in medical cess and also aids in error reduction, eliminating bias and catalyzing
text (Zhang et al., 2020; Nair, Gupta, Devi, & Bhat, 2019), text labelling decision making.
in historical documents (Kim, Lee, Kim & Jeong, 2020) etc. The wide The proposed approach presents an unsupervised active learning
range of applications and the strength of supervised models have trig­ strategy automated by reinforcement learning for optimal sampling in
gered an escalating demand for labelled data that has caused the aspect term extraction. This proposed reinforced active learning
evolvement of the data annotation industry as a billion dollar business.1 approach thus aids supervised models for aspect extraction in mini­
A considerable reduction in the volume of training data to be an­ mising the cost, effort and time devoted to creating qualitative anno­
notated would directly map to a substantial cost reduction in model tated datasets. It is an unsupervised active learning strategy appropriate
implementation. Efforts towards reducing annotation load in supervised in the abundance of unlabelled data. It is a pool based multi-instance
models for aspect level sentiment analysis, a prominent research area, active learning approach to aid optimal sampling for aspect term
have been hardly explored. For the above mentioned reasons, any extraction.
effective approach that would bring in a significant reduction in the The contributions of the work can be summarised as
volume of training data requirement, for aspect term extraction would
be highly appreciated. One way to achieve this goal is to devise a sam­ • It proposes an actor-critic based RL strategy to automate active
pling technique that selects an optimal sample automatically from the learning which uses only minimal seed instances to extract an
unlabelled data. An automated strategy that effectively samples a min­ optimal sample (minimal and most informative) from the unlabelled
imal and most informative subset from the unlabelled data would be a instances.
breakthrough, provided there is a constructive strategy to identify the • An unsupervised guided latent dirichlet allocation strategy has been
effectiveness of each instance in the resulting sample. The whole idea is proposed to select the minimal seed instances which when used to
grounded on the fact that each instance in the training data need not be train a sequence labelling classifier on aspect term extraction will
uniquely informative and hence there could be a subset of the training serve as a starting point for the learning agent of the RL environment.
data almost as informative as the entire data. • The reward function parameters of the RL environment are designed
Active Learning (Angluin, 2001; Balcan, Hanneke, & Vaughan, 2010; using measures of BERT word embeddings based semantic similarity
Ren et al., 2021; Hemmer, Kuhl, & Schoffer, 2022) emerged as a solution and parse tree based similarity to capture both syntactic and se­
in this context by extracting a highly impactful subset from the entire mantic features of the review instances.
dataset which is almost representative of the original dataset. Often • An appropriate optimal sample size is also learned automatically in
referred to as the ML model with “human in the loop”, active learning is the course of agent training rather than other active learning ap­
about a human involved to manually label data between different iter­ proaches which incorporate a separate stopping criterion.
ations of model training. Active learning has been experimented in • The proposed reinforced active learning model is compared with
varied NLP applications like text categorization (Hu, Namee, & Delany, complete train and random sampling models based on the same
2016), POS tagging (Chaudhary, Anastasopoulos, Sheikh, & Neubig, classifier. A fine-grained evaluation and analysis of the proposed
2021), biomedical text mining (Zhang, Huang, & Zhu, 2012), clinical reinforced active learning model in terms of the influential attributes
annotation (Wei et al., 2019), sentiment analysis (Smailovic, Grcar, like aspect term length, aspect term label consistency and out of
Lavrac, & Znidarsic, 2014) etc. The experimentations as seen in litera­ vocabulary words have been performed.
ture have been majorly on multiple approaches to prioritise the quali­
tative instances for labelling (Li & Sethi, 2006; Zhu, Wang, Tsou & Ma, The organization of the paper comes in the following order. Section 2
2009; Settles, Craven & Ray, 2007; Settles & Craven, 2008) and how to discusses the prominent models that have been experimented so far in
iterate over the approach (Zhu, Wang, & Hovy, 2008; Zhu, Wang, Hovy, active learning. The proposed methodology is explained in detail in
& Ma, 2010). Hybrid approaches (Lughofer, 2012; Hantke, Zhang & Section 3, the experimental setup in Section 4 showcases the datasets on
Schuller, 2017) that combine many active learning strategies have also which experiments have been performed and the parameter settings. A
been experimented. There has been considerably less work in active comprehensive presentation and analysis of results obtained is carried
learning for sentiment analysis where most of them were explored for out in Section 5. Finally, the conclusion of the work and insights into the
sentiment classification at sentence level (Koncz & Paralic, 2013; Wang, future directions are presented in Section 6.
Wan, & Zhang, 2019; Zhang et al., 2014). Even though annotation for
the fine grained task of aspect level sentiment analysis is cumbersome, 2. Related works
active learning models have been hardly explored at the aspect level.
Most of the traditional active learning approaches experimented have The fundamental concept behind active learning is that a ML algo­
been criticized for the human intervention required in an iterative rithm can accomplish better performance even with a reduced training
manner. They also suffer from bias towards the classifier. A possible size, provided there is a mechanism to select the data from which it
solution is to automate the active learning process using Reinforcement learns. Active learning is well inspired not only for aspect term extrac­
Learning (RL) (Sutton & Barto, 1998; Baxter, Tridgell, & Weaver 2001; tion but many real world applications, where unlabelled data is plentiful
Lykouris, Simchowitz, Slivkins & Sun, 2021; Zhang, Li, Wang, Cambria, but labelling is expensive, challenging or laborious. The survey presents
an insight into the various techniques experimented in active learning in
two directions a) The explorations in the field of active learning for data
1
https://medium.com/syncedreview/data-annotation-the-billion-dollar- labelling applications in general and b) Active Learning frameworks for
business-behind-ai-breakthroughs-d929b0a50d23. Sentiment Analysis.

2
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

2.1. Active learning explorations to aid data labelling parameter. The strategy is to terminate the active learning process when
no classification change happens to the remaining unlabelled examples
A typical prototypical algorithm (Cohn, Atlas & Ladner, 1994) for during two consecutive cycles. They extended the work (Zhu et al.,
active learning involved training a classifier on seed samples, then use 2010) by introducing four different parameters: maximum and overall
the trained classifier to extract the most informative n instances from an uncertainty, minimum expected error and the chosen accuracy. The
unlabelled dataset, retrain the classifier using those instances and iter­ model strength was ascertained on seven different real world NLP tasks.
atively repeat this procedure until desired classifier performance is Vlachos proposed an unsupervised strategy (Vlachos, 2008) towards the
attained or until a stopping criterion is met. The various experimenta­ stopping criterion for uncertainty-based sampling, where the classifier
tions that followed were towards determining the query strategy i.e. confidence is estimated iteratively and a drop in the classifier confidence
how to choose the most informative sample set and decision making points towards an appropriate termination. Towards an attempt to
regarding the stopping criterion. Based on the problem scenario, three design a stopping criterion that is independent of task or loss function,
scenarios are defined for active learning: membership query synthesis, an approach (Ishibashi & Hino, 2020) based on difference in the ex­
stream based selective sampling and pool based sampling. Membership pected generalization errors and hypothesis testing was experimented.
query synthesis (Hoi, Jin, & Liu, 2006; Wang, Hu, Yuan & Lu, 2015; Guo, The effectiveness of the approach was verified across multiple datasets.
Pang, Bai, Xie, & Chen, 2021) generates new instances instead of The common objective of all the stopping criteria experimented was to
existing ones which is less suitable for real world applications. Stream terminate the active learning process when there are no more infor­
based sampling (King et al., 2004; Smailovic et al., 2014; Bouguelia, mative instances in the pool.
Belaid, & Belaid, 2016) simulates a situation where only one unlabelled Deep Learning frameworks have also been minimally explored in the
instance can be drawn at a time from the data source and a pool based field of AL. A framework (Pal et al., 2020) for deep neural networks that
(Yu, 2005; Kiani, Camp, Pezeshk, & Khoshnevis, 2020; Liu et al., 2021) makes use of active learning techniques to perform sample extraction
scenario assumes that the availability of a small labelled dataset and a has been proposed which doesn’t demand strong domain knowledge for
large unlabelled dataset. the annotator. Their experiments on diverse datasets from text and
Following the path of Cohn et al 1994, Lewis and Catlett derived a image domains, proved that the even with a sample size of 10–30% of
query strategy (Lewis & Catlett, 1994) termed as confidence-based the actual size, the resulting model exhibits better performance. Fang
active learning or uncertainty sampling where the idea was to identify and team (Fang, Li, & Cohn, 2017) have experimented an active learning
and annotate the most uncertain samples. The probability scores of strategy under a Markov decision framework for a stream based sce­
classifiers are used to assign an uncertainty value for each input nario. The algorithm has been designed as a policy based on deep RL for
instance. The experimentations proved that it performed better than the name entity recognition task. The proposers have reported im­
state of art. The method was adaptable for a wide range of classifiers. provements over traditional methods like uncertainty sampling. A
The uncertainty sampling strategy (Li & Sethi, 2006; Joshi, Porikli, & considerable improvement beyond conventional methods and applica­
Papanikolopoulos, 2009; Beluch, Genewein, Nurnberger, & Kohler, bility across tasks has been reported by deep models (Shen, Yun, Lipton,
2018; Lin, Mausam & Weld, 2016) is often criticized to ignore the un­ Kronrod, & Anandkumar, 2018; Li, Fan, Zhang, Ding, & Yin, 2021b; Wu,
labeled space away from the decision boundaries and often being Chen, Zhong, Wang, & Shi, 2021b) throwing light in that direction.
misguided by outliers. As an improvement, a density based criteria (Zhu
et al., 2009; Wu & Xiao, 2019) was proposed which is augmented to the 2.2. Active learning frame works for sentiment analysis
uncertainty parameter based on the assumption that an unlabelled
instance with a high density degree would have a low probability to be Active learning approaches in field of sentiment analysis has been
an outlier. Another popular active learning framework referred to as minimal and mostly applied for sentiment classification at sentence level
impact sampling attempts to query the instance that would impart the (Koncz & Paralic, 2013; Wang et al., 2019; Zhang et al., 2014; Lakshmi
maximum change to the model. Query by Committee (Gilad, Navot, Devi, Subathra, & Kumar, 2016). These works differ only in the query
Tishby, 2005; Beygelzimer, Dasgupta, & Langford, 2009; Vandoni, strategies adopted, the classifier implemented and the datasets on which
Aldea & Le Hegarat-Mascle, 2019; Pan, Wei, Zhao, Ma, & Wang, 2020) they were experimented and reported considerable reduction in training
has been another query strategy experimented where different models data requirement consequent to active learning implementation. An
are trained and tested on common sets. The instances for which largest attempt (Hajmohammadi, Ibrahim, Selamat, & Fujita, 2015) to improve
disagreement among classifiers is reported are drawn into the sample. It sentiment classification in a low resource language, by machine trans­
is advantageous in that it not biased to the findings of a particular lating it into a high resource language and then choose instances for
classifier and it also reduces the search space by querying only in labelling using uncertainty sampling was a notable work that reported
controversial regions. improvement in sentiment classification accuracy. Park et al. imple­
Most of these query strategies attempt exploitation, but strategic mented an active learning strategy (Park, Lee & Moon, 2015) to aid
exploration is also required. Hybrid query strategies (Shui, Zhou, Gagne, building a domain specific sentiment that in turn improves the accuracy
& Wang, 2020; Ash, Zhang, Krishnamurthy, Langford & Agarwal, 2020; of sentiment classification. An active learning strategy specifically fine-
Wu, Chen, Zhong & Wang, 2021a) which combine multiple sampling tuned for aspect level sentiment analysis (Smatana, Koncz, Smatana &
strategies have contributed towards balancing exploration and exploi­ Paralic, 2013) was experimented on hotel reviews where the query
tation of the input data space. Edwin Lughofer proposed a novel active strategy was a combination of two informativeness co-efficients based
learning strategy (Lughofer, 2012) for data driven classifiers, that fol­ on the confidence value of the naive bayes classifiers designed for aspect
lows an unsupervised guideline during offline training, and a supervised detection and sentiment classification respectively. Another interesting
certainty based criterion for online training and hence projected as a model fine-tuned for aspect detection (Hadano, Shimada and Endo,
hybrid model. Another interesting hybrid approach tries to combine the 2011) involved a clustering technique using Bayon tool to cluster un­
benefits of crowd sourcing and ML algorithms for audio processing to labeled samples and choose instances which are close to the centroid for
annotate audio segments (Hantke et al., 2017). The experimentations of labelling. A recent active learning framework (Shim, Lowet, Luca, &
these varied query strategies were in the fields of image classification, Vanrumste, 2021) fine-tuned for aspect category detection and senti­
text classification/categorization, medical image segmentation, word ment classification claims its novelty by a task specific pre-training from
sense disambiguation etc. unlabeled samples called masked language modelling. The model also
Another challenging aspect in Active Leaning is the stopping crite­ addresses the cold start issue in active learning by duplicating the
rion for which Zhu and team in 2008 (Zhu et al., 2008) considered the labelled dataset by replacing aspect categories with similar words.
capability of each unlabelled instance to alter the decision boundary as a Table 1 presents a quick overview of the surveyed works on active

3
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Table 1
Active Learning works in the field of Sentiment Analysis.
Model Domain Evaluation Task Query Strategy Classifier Performance enhancement

Koncz et al. (2013) Movie reviews Sentence level Uncertainty sampling, Dictionary SVM Obtained 96% of results on the whole
sentiment Based corpora
classification
Zhang et al. (2014) Amazon reviews Sentence level Uncertainty sampling CRF Reduction in training data achieved(no
(Digital Camera, TV) sentiment quantitative measure reported)
and Facebook classification
comments
Lakshmi Devi et al. Movie reviews Sentence level Entropy Sampling in Uncertainty Naive Bayes, SVM Both sampling strategies performed
(2016) sentiment Sampling, Kullback-Leibler equally good except for uncertainty
classification divergence and Vote Entropy in sampling reporting a better accumulative
Query by Committee iteration time
Wang et al. (2019) Hotel reviews Sentence level Vote Entropy in Query by Multiple SVM More than 50% reduction in training size
sentiment committee classifiers with a 2% gain in sentiment classification
classification accuracy
Hajmohammadi Book reviews Cross lingual Uncertainty sampling, density SVM Selection of unlabelled data from the
et al. (2015) sentiment based criterion target language using active learning
classification improved sentiment classification
accuracies
Park et al. (2015) Multi-domain dataset Domain specific Lexicon coverage analysis SVM Active learning contributes to building an
from Amazon sentiment lexicon algorithm enhanced domain-specific sentiment
building lexicon which results in higher sentiment
classification accuracy.
Hadano et al. Multi-domain Aspect identification A clustering approach where SVM Achieved almost 91% results of full
(2011) instances closer to the centroid of supervision utilizing only 12% of training
each cluster are chosen for labelling data
Smatana et al. Hotel Reviews Aspect level A co-efficient that combines the Naive Bayes Results on par to full supervision achieved
(2013) sentiment analysis confidence of the classifiers for for aspect sentiment classification with
aspect identification and sentiment 70% of the training data.
classification Data reduction achieved for aspect
detection was minimal
Shim et al. (2021) Health domain, Aspect category Uncertainty sampling Pre-trained deep 2–3 times reduction in annotation load
Restaurant reviews detection and bidirectional
sentiment transformer
classification

learning in the field of sentiment analysis in terms of the task explored, challenge. Active learning is also vulnerable to biases, because they are
domain experimented, query strategy adopted and so on and have been at risk of being misguided in initial beliefs derived from the patterns
grouped based on the evaluation task. recognized by a model trained on a small dataset. A possible solution to
Most of the experimental results in the literature suggest that active the existing bottlenecks is to enhance active learning by automation
learning actually works, with rare exceptions (Schein & Ungar, 2007; using reinforcement learning, a behavioural learning model where the
Guo & Schuurmans, 2008) as reported by the authors. The various algorithm provides data analysis feedback. The model has been hardly
classic approaches or a combination of them were considered by re­ explored to find a subset large enough to contain useful information and
searchers for decision making regarding the informativeness of each compact enough to learn a good policy suitable for commercial
instance selected or the stopping criterion. Uncertainty sampling, den­ applications.
sity based, impact sampling, query by committee and hybrid methods The proposed approach is a framework for optimal sampling to aid
have been explored widely as query strategies that aimed to optimize the data annotation for aspect term extraction based on active learning
exploration versus exploitation strategy. The active learning models automated using Reinforced Learning. Here the decision making of
experimented in the field of sentiment analysis have been minimal and choosing or discarding an instance is not influenced towards a particular
have been hardly explored at the aspect level. The majority of the model learning rather governed by a reward function which is defined
models implemented had the requirement of the “human/labeller in a by parameters to carefully identify the informative instances. A pool
loop” because they involve incrementally choosing instances for data based multi-instance active learning (Carbonneau, Cheplygina, Granger,
labelling during the training phase which permits the algorithm to & Gagnon, 2018) scenario is designed for the problem as the decision
identify the most informative instances that would accelerate learning. regarding the informativeness of a sample is based on the evaluation of
But in a realistic environment the annotators would prefer to perform the entire sample. An appropriate sample subset size is also learned
the annotation of the selected pool in one shot especially in models with automatically in the course of agent training rather than incorporating a
higher training times. Another concern with the traditional active stopping criterion. Thus the bottlenecks of continued human interven­
learning models is the issue of reusability. There is no guarantee that the tion and model bias is resolved. Our initial experimentations in the field
informative instances chosen by the base classifier inbuilt in an active of automatic labelling in medical discharge summaries (Tandra, Nau­
learner can be reused by a different classifier. Considering the imple­ tiyal, & Gupta, 2020) and aspect term extraction in sentiment analysis
mentation overheads of active learning models, the question about (Shyam Sundar & Gupta, 2020) have inspired us to take the work
active learning has moved on from “Can a machine learn with fewer in­ forward.
stances if it asks questions?” to “Would a machine learn more economically
if it ask questions.2 Although there has been active research in the area of 3. Proposed methodology
AL, the problem of expanding to high dimensional data also remains a
The proposed model aims to minimize the labelled training data
requirement for sequence labelling models for aspect term extraction
and hence proposes an active learning strategy coupled with
2
https://burrsettles.com/pub/settles.activelearning.pdf.

4
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

reinforcement learning (RL) which performs an unsupervised learning agent data and test data are converted into numerical vectors based on
and a sample selection process. When the reinforced learning environ­ pre-trained word embeddings base on BERT-BASE model.
ment does not have any prior knowledge about the application task and
the agent training is unsupervised, it may be difficult for RL to converge Algorithm 1: Guided LDA based seed instances selection for the base model
on difficult problems. Hence a base model is trained on a minimal seed
set for sequential aspect labelling and the final state of this model serves Input: The unlabeled review sentences R, a basic set of 15 aspect seed words
uniformly distributed across five aspect categories (S), the count of required seed
as a starting point for the agent. The proposed architecture for rein­ instances (N)
forced active learning for sequential aspect term labelling is depicted in Output: The minimal set of seed instances Rseed
Fig. 1 where the model is sequenced through the following phases. 1. Extract only the nouns/noun phrases from each review sentence in R and hence represent
each review sentence as a list- the output being IP
2. Expand the basic set of 15 aspect seed words S using BERT based semantic similarity-the
• Pre-Processing and Data Segregation
output being Sexp
• Base Model Generation for Sequential Aspect Labelling 3. The inputs IP and Sexp are fed to the Guided LDA algorithm
• Sample Selection using Reinforced Active Learning 4. Estimate the probability of each review sentence belonging to each aspect category
• Reinforced AL Model Generation and Evaluation 5. Select the top most probable review sentences from each category to contribute to a total
of N instances which forms Rseed

3.1. Pre-Processing and data segregation

All the unlabeled review instances are subjected to pre-processing, 3.2. Base model Generation for sequential aspect labelling
which includes removing punctuations, replacing contractions (I’m
with I am, she’ll with she will etc.) and tokenization. Four disjoint A BiLSTM-CRF based deep neural network (Huang, Xu & Yu, 2015;
datasets seed data, agent data, support data and test data that are uti­ Miao, Cheng, Ji, Zhang, & Kong 2021; Mir & Mahmood, 2020; Kumar,
lized in the different phases of model building and evaluation have to be Verma & Sharan, 2021b; Gandhi & Attar, 2020) which are proven
segregated from these pre-processed review instances. Seed data (Input effective models for sequential text labelling applications is the choice
1) is a minimal set of instances chosen from the review instances using for the base model. This base model is trained using the minimal labeled
an unsupervised guided LDA approach as shown in Fig. 1. This seed data seed instances, seed data (Input 1). The base model architecture is
is used to train the base model for aspect term extracted designed as a depicted in Fig. 3. The input to the BiLSTM layer is given in the form of
sequence labelling model. The unsupervised smart seeding strategy word embeddings of the seed data (Input 1) based on BERT-BASE
explained in Algorithm 1 ensures that review instances chosen are (Devlin, Chang, Lee & Toutanova, 2019) model with a vector size of
qualitative for the task of aspect term extraction and ensures a uniform 768. Hence, there are 768 nodes in the input layer and 450 hidden units
coverage across each aspect category. Aspect category refers to the in the BiLSTM layer. As observed in Fig. 3, the output generated from the
broad category of the aspect terms (food, ambience, pricing etc.) as seen BiLSTM layer is processed by the CRF layer with 3 output nodes corre­
in the example portrayed in Fig. 2. Latent Dirichlet Allocation (LDA) sponding to the final BIO labels/tags. The training is done over 20 it­
(Jagarlamudi, Daume & Udupa, 2012) technique forms the pillar of erations and loss is calculated using a negative log likelihood function as
Algorithm 1. The unlabeled review instances and an initial aspect seed the difference between the gold score calculated from the actual labels
set of manually chosen minimal aspect seed words representative of and the predicted score. The loss is stacked after each sequence of data
each aspect category form the inputs to the algorithm. A total of 15 and back propagated after each prediction. The final state of the base
aspect terms corresponding to five broad aspect categories which are model, which is the weight matrix of the BiLSTM layer at the end of the
derived from generic domain knowledge form the seed aspect terms. The final iteration is saved and input to the RL agent as its initial state as
unlabeled instances are converted into a list of nouns and noun phrases. observed in Fig. 1.
Further the initial aspect seed set is expanded by appending tokens from
the input vocabulary which have a BERT based semantic similarity of at 3.3. Sample selection using reinforced active learning
least 0.7 with any of the seed words. The expanded seed set Sexp facili­
tates the Guided LDA algorithm for better topic convergence. The values The active learning strategy implemented in the proposed model is
for guided LDA hyperparameters α, β which influence review-topic and based on reinforcement learning (Sutton & Barto, 2018). The Actor-
word-topic distributions, the number of topics K and seed-confidence which Critic model3 which combines the advantages of both value based and
is the reliability factor of aspect seed words are set for the Guided LDA policy based algorithms is the chosen model for the Reinforcement
algorithm which finally outputs the probability of each review sentence Learning (RL). The disadvantages with prior models like Monte Carlo
belonging to each aspect category. The review sentences with higher (Wang, Won, Hsu, & Lee, 2012) where strategy updation happens only
probability values are extracted from each aspect category to form the at the end of each episode and high variance drawbacks are resolved
seed instances. The idea behind seed instances selection is derived from using the Actor-Critic model. The Actor-Critic model employs two
our prior work (Venugopalan & Gupta, 2022). The seed instances thus different components — the Actor which uses and implements a policy
retrieved (Rseed) are manually labelled and input as seed data (Input1) to to take an action and the Critic which judges the action in terms of re­
the base model. Seed model instances are labelled using the BIO label­ wards. The actor and the critic work in conjunction to form an agent. In
ling notation where the beginning word of the aspect term is marked B, due course of training, the agent is expected to learn the task of choosing
any following words in the aspect term as I and non-aspect terms as O. informative review instances with respect to sequential aspect labelling.
Fig. 2 shows a glimpse of how review sentences are converted from The Actor-Critic model for agent training and its implementation ar­
standard labelling formats to BIO labelling notations for sequence chitecture are explained in detail in Section 3.3.1. Section 3.3.2 depicts
labelling models. the process by which the trained agent extracts the optimal sample.
After separating the seed data from the review instances, agent data,
support data and test data are segregated from the remaining instances 3.3.1. Actor-Critic model using Self attention RNN for agent training
as shown in Fig. 1. Agent data (Input 2) is a sufficiently sized set of The actor-critic model for agent training is depicted in Fig. 4. The
unlabeled instances for the RL agent training. Support data (Input 3) is a
set of unlabeled instances from which the trained agent extracts the most
informative samples. Finally test data are the samples used to evaluate 3
Understanding Actor Critic Methods and A2C, https://towardsdatascience.
the reinforced active learning model. Further, seed data, support data, com/understanding-actor-critic-methods-931b97b6df3f, 2019/02/16.

5
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Fig. 1. Proposed Architecture for Reinforced Active Learning for Aspect Term Extraction.

6
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Fig. 2. BIO labelling generated from standard HTML formats for aspect term extraction.

Fig. 3. Base Model Architecture (BERT + BiLSTM + CRF) for Sequential Aspect Labelling.

actor-critic model is represented as a tuple (S, A, R) where S represents task of aspect term extraction. An instance is randomly chosen from the
the current state of the agent, A, the possible set of actions and R the agent data. As it traverses through a forward pass of the architecture
calculated reward. The actor-critic agent is implemented using a Self- shown in Fig. 5, two different outputs are drawn. The first output is the
Attention RNN (Pal et al., 2020) model, which is capable of learning action probabilities, a one dimensional vector of size three, where each
long sequences by facilitating the inputs to interact which other. The value in the vector signifies the probability of the action to be taken by
Self-Attention RNN architecture implemented for agent training is the agent. The three different actions that the agent could adopt include
showcased in Fig. 5. As observed in Fig. 5, the implemented architecture adding an instance to the bag, discarding an instance and ending the episode
for the actor-critic agent employs a GRU with one hidden layer con­ by reinitialising the network to the seed state. The second output is the
taining 450 hidden units. This is followed by layers for linear com­ state value derived from the action probabilities through linear
pressions, fully connected layers and finally linear compression layers compression. The next action to be taken is calculated from these
with softmax activation to derive output probabilities. probabilities and the environment calculates the reward based on the
The unlabelled instances termed as agent data (Input 2) are input to action and the current state as illustrated in Fig. 4.
the agent for training as shown in Fig. 4. The word embeddings repre­ The reward function that calculates the reward based on the current
sentations of the agent data are given as the input to the agent. The state and the action taken is the most crucial part of the RL agent. The
training starts with an empty bag in the environment. The agent is ini­ decision to retain or discard a particular review instance depends on the
tialised using the final state of the base model (the weight matrix of the value produced by the reward function. The reward function is based on
BiLSTM layer at the end of the final iteration) as observed in Fig. 1. This a maximization algorithm, where the higher value of reward provides a
serves as a starting point for the agent to orient itself towards the specific higher chance for the review instance to be retained in the bag. The

7
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Fig. 4. Actor Critic Model for Agent Training.

Fig. 5. Actor Critic Architecture for Agent Training.

reward is influenced by four parameters; the average distance between is calculated using the subtree kernel concept in terms of the number of
each instance in the bag in terms of their semantic similarity, the common subtrees between the parse tree of both the review sentences.
average distance between each instance in the bag in terms of their parse The average tree kernel distance of the review instances in the bag at a
tree similarity and a penalty against the size of the bag and a parameter time t is calculated using Equation (2).
called bonus which is based on the current action taken and the number ⎧ ∑|bag|− 1 ∑|bag|
of actions taken before. The semantic distance is calculated using the ⎪
⎨ i=1 j=i+1
TKD(ri , rj )
Word Mover Distance method (WMD) (Kusner, Sun, Kolkin & Wein­ AvgTKDt = C(|bag|, 2) , if |bag| ∕
=0 (2)

berger, 2015) which expresses the dissimilarity between two review ⎩
− 1, otherwise
instances as the minimum distance that the embedded words of one
instance need to traverse to reach the embedded words of another re­ The third parameter that influences the reward is the penalty against
view instance. If |bag| denotes the number of instances in the bag at a the size of the bag which is calculated using Equation (3) where mid is a
time instance t, the average word movers distance of the bag at a time base value set for the bag size which controls the size of the bag during
instant t is calculated using Equation (1). agent training.
⎧ ∑|bag|− 1 ∑|bag| 4

⎨ i=1 j=i+1
WMD(ri , rj ) Penaltyt = (mid− |bag|)2
− 1 (3)
AvgWMDt = C(|bag|, 2) , if |bag| ∕
=0 (1) 1 + e− mid


− 1, otherwise A parameter called bonus is calculated in accordance to the action
taken by the agent sketched by Equation (4) where N is the number of
The parse tree distance which is a measure of structural dissimilarity actions taken by the agent till time t. Finally, the reward of the RL
of the review sentences, is calculated using tree kernel concept environment is calculated from all these parameters as expressed in
(Moschitti, 2006). This measure labelled as Tree Kernel Distance (TKD) Equation (5).

8
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228



⎪ 1 × e−
N
100 , if action = 0(adding an instance) 3.3.2. Sample subset extraction by the trained agent

⎨ The unlabelled set from which the most informative samples have to
(4)
N
Bonust =

− 1 × e− 100 , if action = 1(discarding an instance) be extracted by the agent post the training phase has been depicted as



− 1×e − N
100 , if action = 2(end episode) Support Data (Input 3) in Fig. 1. On completion of agent training, the
support data is input to the agent from which it extracts the minimalistic
Reward t = AvgWMDt + AvgTKDt + Bonust − Penaltyt (5) and most informative sample subset. During this process the agent is
made to iterate through each of the instances in the support data in a
As observed, the reward is positively affected by larger semantic sequential manner to ensure a fair chance of participation for every
distances, as the goal is to find a set of instances that are semantically far instance. The three actions taken by the agent during this sample
away from each other. This applies to syntactic distance as well where extraction phase would be either adding an instance, discarding an
the measure considered is parse tree distance which measures the instance or start a fresh bag after saving the current bag. Finally, when
structural dissimilarity of the review instances. The size of the bag has a the agent has been iterated through all the instances in the support data,
negative impact on the reward as the goal is to sample a minimal subset. all the instances chosen into the bag is the extracted optimal sample set.
The reward function is thus designed to segregate review instances that The sequence of steps which happen in the sample subset extraction
are semantically distant from each other, varied by the sentence struc­ phase is depicted briefly in Algorithm 3.
ture in terms of parse trees and is of an optimal size to incorporate
sufficient information. Algorithm 3: Sample subset extraction by Trained Agent
Once the action is taken by the agent based on the action probabil­
ities, the reward is computed using Equation (5). The action brings about Input: Support Data (Input 3) as word embedding vectors, Trained agent
a change in the environment and hence the state value is recomputed. Output: Optimal sample
1. Pass an instance in the support data to the agent
The reward is sent to the critic component as seen in Fig. 4 to correct and 2. Get output (action probabilities and state value)
update its estimates and hence update the actor component. The upda­ 3. Calculate current action and reward (action could be adding instance to bag, discarding
ted state value is send both to the actor-component which utilises it for or start a fresh bag after saving the current bag)
taking the next action and the critic-component which uses it for value 4. Repeat 2–4 over all instances in Support data sequentially
5. Combine all the instances in the saved bags to form the final bag which is the optimal
estimation. This makes the actor-critic more competent by continuously
sample
learning from updates rather than at the end of the episode. This process
iterates through randomly chosen instances from agent data until the
action chosen by the agent is to end the episode. At the end of each
episode, the loss is determined from the state value, action and reward as 3.4. Reinforced AL model Generation and evaluation
shown in Fig. 4 and the loss is back-propagated. The bag is emptied and
the agent is reinitialised to the seed state of the base model. The process The minimalistic sample selected by the active learning model from
is repeated for a sufficient number of epochs to improve the agent the Support Data (Input 3) is believed to be representative of the entire
learning. Thus, the agent learns over a period to include only those re­ support data. The instances in this optimal subset are manually labelled
view instances significant with respect to the sequence labelling appli­ which forms Input 4 as depicted in Fig. 1 and are appended to the seed
cation of aspect term extraction. The whole sequence of processes that instances (Input 1) used to train the base model. This combined data
happens during agent training is sketched briefly in Algorithm 2. which is believed to be representative of the entire training data is used
to train the same classifier used as the base model to develop a rein­
Algorithm 2: Agent Training forced AL model. The trust worthiness of the reinforced AL model is
proved by comparison with another version of the base model classifier
Input: Final state of base model, Agent Data (Input 2) as word embedding vectors
Output: Trained agent trained on equal number of instances which are randomly chosen from
1. Initialize agent state using the final state of the base model the support data. The reinforced AL classifier is also compared against
2. Randomly choose an instance from Agent Data and input to the agent the base model trained on the entire training data.
3. Get outputs (action probability and state value)
4. Calculate current action and reward (action could be adding instance to bag, discarding
or end episode)
4. Experimental setup
5. Repeat steps 2 – 4 until action = end episode
6. Calculate loss value 4.1. Datasets
7. Back propagate loss value
8. Reset the agent hidden state and the bag
The proposed reinforced active learning model for optimal sampling
9. Repeat steps 1 – 8 over K iterations
has been experimented on the text labelling application of aspect term
extraction in sentiment analysis. The SemEval 2014–2016 datasets have
been utilised for the experiments. The data contains reviews sentences
The active learning strategy implemented here is multiple instance from restaurants and laptop domains. Each review sentence is annotated
active learning strategy as the decision regarding the informativeness of with aspect terms which is mapped into a BIO labelling as presented in
a review instance with respect to the labelling task is not an individual Fig. 2. The results have been reported on four datasets Restaurant-2014,
decision but a collective division of all the instances in the extracted Restaurant-2015, Restaurant-2016 and Laptop-2014. The review sen­
sample/bag at that point of time as clear from Equations (1)-(5) related tences in the training data of each dataset have been utilised to create
to reward function. The decision making triggered by the reward func­ the seed and the support data requirements in the proposed model. A
tion is based on first order Markov Model assumption. This implies that certain proportion of the training data is used as the seed data. The
in the course of appending instances to the bag, the decision regarding remaining instances in the training dataset (without labels) after choice
the inclusion of the next instance is based purely on the current state of of seed data becomes the support data. Experiments varying the seed
the bag. A new instance would be included in the bag only if that triggers data proportion from 10% to 1% have been conducted. Table 2 depicts
an increase in the reward after adding the instance into the bag. If the the data segregation strategy on the chosen datasets if the seed model
inclusion would trigger a reward dip, the corresponding instance is consumes 3% of the training data. In a similar manner, experiments
discarded. The agent trained over the Agent data (Input 2) is now varying the seed data proportions through 10%, 7%, 5%, 3% and 1% of
capable of extracting an optimal sample for sequential aspect labelling the training data have been experimented where the support data pro­
from unlabelled review instances, which is explained in Section 3.3.2. portions would be 90%, 93%, 95%, 97% and 99% respectively. The

9
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Table 2 optimiser from PyTorch has been used to optimise the base model i.e.
Data segregation statistics of SemEval datasets for proposed model experiments the BiLSTM-CRF network. The agent network, the Self-attention RNN
(The statistics in the table shows a data segregation instance when seed model is has been optimized using the Adam optimiser from PyTorch. During
chosen as 3% of train data). agent training, one of the parameters that influence the reward function
Dataset Complete Seed Data Support Data Agent Data Test is penalty which is parameterized by mid, a factor controlling the bag size
Train Data Data as depicted in Equation (3). The value of mid is calibrated to 50 based on
Input 1 Input 3 Input 2 trial and error experiments which contributed to an appropriate agent
(Labelled) (Unlabelled) (Unlabelled) learning with respect to the bag size. During agent training, the sample
(3% of Training selection at each episode is random, the gradient of the model would
Complete Data - Seed stabilize over epochs. Hence, the agent is trained over 5000 epochs to
Train Data reduce the loss incurred. The parameter settings of the BiLSTM-CRF
Data) (97% of
network used as the base model and the Self-attention RNN network
Complete
Train Data) for agent training are given in Table 4 and Table 5 respectively.
The evaluation metric used for evaluating the proposed model and its
Restaurant- 3041 91 2950 3315 800
2014
comparison with other models is the F measure of aspect terms, Preci­
Restaurant- 1315 40 1275 5041 685 sion and Recall. The proposed model is also subjected to an attribute
2015 based evaluation. Instead of reporting the performance on the entire test
Restaurant- 2000 60 1940 4356 676 data, disjoints sets from the test bed which represent varied degrees of
2016
complexity corresponding to attributes like aspect term length, label
Laptop- 3045 91 2954 4239 800
2014 consistency and out of vocabulary beds are created and the performance
is reported on the F-measure on these test data partitions separately.

unlabelled training instances of other datasets from the same domain are 5. Results and analysis
utilised as agent data. For example, training data from 2015 and 2016
restaurant datasets together contribute the agent data for Restaurant- The presentation of results and analysis include ablation studies
2014 dataset and so on. This ensures that agent data is disjoint with focused towards assessing the strength of the Guided LDA module for
the seed data and support data. For Laptop-2014 dataset, the unlabelled smart seed instance selection strategy, decision making for volume of
training instances from 2015 and 2016 datasets of the laptop domain seed input and assessing the strength of the reinforced active learning
served as agent training data. The laptop datasets from 2015 and 2016 model for sample selection. The proposed model performance is then
were not labelled for the aspect term extraction task and hence experi­ compared to the state of art results, followed by an attribute based fine
mentations were restricted only to Laptop-2014 dataset. In Table 2, the grained analysis. The highlights of the section can be summarized as
seed, support and agent data are mapped to the labels Input1, Input 3 and
Input 2 respectively as referred to in Fig. 1. The proposed RAL model • The optimal seed input to the proposed reinforced active learning
extracts the optimal sample Input 4 from Support Data (Input 3). model (RAL) model and hence the total training data requirement of
the proposed RAL model is derived in Section 5.1. The strength of the
smart seeding strategy using Guided LDA is also ascertained by
4.2. Parameter settings and evaluation metrics
comparing it with seed instances chosen using random sampling.
• The proposed model is evaluated in terms of its comparison with
Stanford Core NLP V4.0.0 was used for text pre-processing tasks.
results of random sampling and the model trained on complete train
With respect to the guided LDA parameters used for seed instance se­
data in Section 5.2.
lection, the value of K, the number of topics is set to 5. This is in
• A comparison of the proposed RAL model with recent supervised
accordance with the broad aspect categories Food, Ambience, Service,
baselines for aspect term extraction in Section 5.3.
Cost and General exhibited across all the SemEval datasets. The hyper­
• A fine-grained performance analysis of the proposed RAL model
parameters of the guided LDA model α and β which influence review-
based on attribute based evaluation in Section 5.4
topic and topic-word distributions respectively are set to the default
values of 0.1 and 0.01 suitable for the task under consideration. A low
value of α is in favour of a review sentence containing a few or even only
one of the aspect categories. Similarly, a low value for β is in favour of a 5.1. Optimal seed set selection strategy for the proposed RAL model
skewed word-topic distribution. The Guided LDA algorithm also takes
the seed-confidence value as an input parameter which shows the bias The seed instances that need to be input to the base model (BERT +
towards the seed set. This value has been set to 0.8 as the expanded seed BiLSTM + CRF) are chosen using a smart seed instance selection strat­
set is generated through a semantic similarity based iterative algorithm egy, the unsupervised Guided LDA approach described in Algorithm1. A
which ensures the reliability of the aspect seed set. The parameters for fixed percentage of the train data are input as seed instances (Input 1) to
guided LDA are listed in Table 3. the base model and the its final state serves as the starting point for agent
The BERT-BASE word embeddings from the pre-defined BERT model training. The trained agent extracts the most informative sample from
of vector dimension 768 were provided as inputs to the base model as the support data (Input 3). The base model trained on this most
well as the actor critic models. PyTorch-Transformers were used to
generate BERT embeddings. Stochastic Gradient Descent (SGD)
Table 4
Parameter settings of the Bi-LSTM CRF network used as base
Table 3 model.
Parameter settings of the Guided LDA Algorithm.
Parameter Value
Parameter Value
Dimension of word embeddings 768
No of Topics - K 5 Learning Rate 0.01
Hyperparameter for Review-Topic Distribution- α 0.1 Weight Decay 0.0001
Hyperparameter for Word-Topic Distribution- β 0.01 Hidden Size 450
Seed-confidence 0.8 Training Epochs 20

10
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Table 5 observations, a conclusion was arrived regarding the proportion of


Parameter settings of the Self-Attention RNN network for agent unlabelled instances to be input to the base model which guarantees an
training. appreciable performance of the reinforced active learning model.
Parameter Value The conclusion from the observations based on Fig. 6 and Fig. 8 is
Dimension of word embeddings 768
that with respect to scenarios where BIO based F-measure evaluation
Learning Rate 0.0001 makes sense, the model that intakes 3% seed data instances is the best
Hidden Size 450 model and with respect to aspect terms F-measure based model evalu­
Training Epochs 5000 ation the model that intakes 7% seed data instances performs the best
which are chosen as the proposed reinforced active learning model in
both these scenarios which would be referred to as proposed RAL Model-
informative sample subjected to labelling (Input 4) coupled with the
1 and proposed RAL Model-2 respectively. The training data require­
labelled seed data instances (Input 1) is the proposed reinforced active
ment of these models needs to be investigated as the goal of the proposed
learning model for aspect term extraction. A decision making regarding
model is to extract a minimal optimal sample for training which is
the optimal percentage of seed instances to be input to the base model is
almost representative of the entire training data.
taken by performing a set of experiments varying the seed set proportion
The training data requirement and hence the data reduction ach­
through 10%, 7%, 5%, 3% and 1% of the training data. A set of exper­
ieved in the training data by the proposed RAL models is depicted in
iments were also conducted in parallel using similar proportion of seed
Table 6 and Table 7 respectively. As observed in Table 6, the proposed
set inputs chosen randomly from the training data. This is performed to
RAL Model-1 utilises only a maximum of 7.3% of the training data to
assess the strength of the smart seeding strategy.
achieve the results reported in Fig. 6. Thus the proposed RAL Model-1
The observations from these experiments are sketched in Fig. 6,
requires only 3% seed data instances to initiate the base model and
Fig. 7 and Fig. 8 respectively. Fig. 6 displays the reported performance of
hardly 7% of the training data in the labelled form is required for
the proposed reinforced active learning model in terms of the micro
training the model. The proposed RAL Model-2 requires only 7% of seed
averaged F-measure of all the classes B, I and O. As observed the pro­
data instances to initiate the base model and the final labelled data
posed reinforced active learning model performance enhanced when
requirement is hardly 9 to 13% of the training data across the datasets
seed proportions were varied down from 10% to 3% but noticed a dip
experimented as observed in Table 7. The total number of instances
when the sample size was still reduced to 1%. This observation holds
required for training the proposed RAL models are presented in column
good for all datasets from the restaurant domain but for the laptop
6 of Table 6 and Table 7 respectively. These are the only instances that
domain a gradual dip in F-measure was observed with seed instances
need to be labelled for training the proposed models.
reduced to 7%, 3% and 5% and a major dip in performance when the
Thus the proposed RAL models are.
seed sample is further reduced to 1%. Fig. 7 displays the results of the
proposed approach without the smart seeding strategy evaluated in
• RAL Model-1 which takes 3% seed data input for the base model
terms of the micro averaged F-measure of the BIO classes. The perfor­
training and the reinforced active learning model is the base model
mance of the proposed model when randomly chosen seed instances
trained on seed instances + sample extracted by the learned agent
replaced the instances chosen through smart seeding strategy is show­
from support data. This model is optimal for scenarios where micro
cased in Fig. 7. The performance dip across all seed proportions as
averaged F-measure of B, I, and O classes based evaluation is
visible in Fig. 7 when compared to Fig. 6 clearly indicate the strength of
preferred.
the smart seeding strategy using Guided LDA.
• RAL Model-2 which takes 7% seed data input for the base model
Fig. 8 depicts the proposed model performance where the perfor­
training and the reinforced active learning model is the base model
mance is reported in terms of the F-measure of the aspect terms which is
trained on seed instances + sample extracted by the learned agent
the standard evaluation metric with respect to aspect term extraction. As
from support data. This model is optimal for scenarios where F-
observed the proposed reinforced active learning model performance
measure of aspect terms based evaluation is preferred. This evalua­
enhanced when seed proportions were varied down from 10% to 7% but
tion metric credits the capability of the proposed model to predict the
noticed a dip when the sample size was still reduced to 5%, 3% and 1%.
aspect terms completely and is not influenced by the more frequent
This observation holds good for all datasets from the restaurant domain
‘O’ class.
but for the laptop domain a gradual dip in F-measure was observed with
seed instances reduced to 7%, 3%, 5% and 1%. Based on these

Fig. 6. RAL model performance with variations in seed proportions- metric being micro averaged F measure of BIO classes.

11
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Fig. 7. RAL model performance excluding smart seeding strategy with variations in seed proportions- metric being micro averaged F measure of BIO classes.

Fig. 8. RAL model performance with variations in seed proportions- metric being F measure of aspect terms.

Table 6
Training data segregation for the proposed RAL Model-1.
Dataset Complete Train Seed set Support Sample extracted by Agent from Total sample used for training by % of Train data utilized by
data Data Support Data Proposed RAL Model-1 proposed RAL Model-1

(Input 1) (Input 3) (Input 4) (Input 1 + Input 4)

Labelled Unlabelled Unlabelled Labelled

Rest-2014 3041 91 2950 68 159 5.23


Rest-2015 1315 40 1275 15 55 4.18
Rest-2016 2000 60 1940 86 146 7.30
Laptop- 3045 91 2954 81 172 5.65
2014

5.2. Proposed RAL model evaluation chosen and the average results are reported. This ensures that the
results are not biased to any specific sample chosen.
The proposed RAL models are evaluated by comparing it with.
The results of these comparisons are reported for proposed RAL
• Base Model on Complete Train Data: The base model (BERT + Model-1 and RAL Model-2 in Fig. 9 and Fig. 10 respectively in terms of
BiLSTM + CRF) trained on the entire training data. their F-measure on the test data. As observed from Fig. 9, the proposed
• Base Model on Random Sample: The base model (BERT + BiLSTM RAL Model-1 has performed significantly better than random sampling
+ CRF) is trained on randomly chosen instances from the training with a visible improvement of around 5 to 9 points in the micro-
data of a sample size same as that of the data utilised by the proposed averaged F-measure across datasets. This ascertains the success of the
RAL models for training (#instances in Input 1+# instances in Input model in choosing a sample which assures a performance better than
4). The experiment is repeated ten times varying the random sample random sampling. It can be observed that the proposed model is almost

12
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Table 7
Training data segregation for the proposed RAL Model-2.
Dataset Complete Train Seed set Support Sample extracted by Agent from Total sample used for training by % of Train data utilized by
Data Data Support Data Proposed RAL Model-2 proposed RAL Model-2

(Input 1) (Input 3) (Input 4) (Input 1 + Input 4)

Labelled Unlabelled Unlabelled Labelled

Rest-2014 3041 213 2828 64 277 9.11


Rest-2015 1315 92 1223 23 115 8.75
Rest-2016 2000 140 1860 119 259 12.95
Laptop- 3045 213 2832 105 318 10.44
2014

The propose RAL Model-2 is further compared with the complete


train model in terms of Precision and Recall of aspect terms extracted in
Fig. 11. It can be observed that the proposed model is almost balanced in
precision and recall, with recall marginally on the higher side across
most of the datasets reported. This behaviour is similar to that of the
model trained on the complete train data. Rest-2016 is the only dataset,
where the proposed RAL Model-2 has reported a variation between
precision and recall measures at 70.01 and 77.61 respectively. Surpris­
ingly the complete train data has also reported relatable results on the
same dataset.
The learning assimilated from the comparative study is that if a well-
defined strategy can extract a minimal and most informative sample
from a large data pool, a model trained on that minimal subset is capable
of achieving almost on par results with the model trained on larger
samples and definitely outperforms models trained on random samples.
Fig. 9. Proposed RAL Model-1 comparison with complete train and random
sampling results.
5.3. Comparison with recent supervised baselines for aspect term
extraction

The proposed work created a reinforced active learning framework


for optimal sampling where the base model chosen is a BERT + BiLSTM
+ CRF which guarantees appreciable performance in the field of aspect
term extraction despite being a simple deep learning model. The pro­
posed RAL Model-2 which is trained on a minimal informative sample
which is barely 13% of the training data is compared against a few
recent supervised baselines from the past five years trained on the
complete training data that have reported their results on SemEval
datasets in Table 8. The objective of this illustration is not a direct
comparison with state of art, but if the proposed model trained on a
minimal subset sample is able to perform close to state of art, that would
be an approval of the robustness of the optimal sampling strategy. Here
the proposed RAL Model-2 is compared with the state of art baselines in
terms of the F-measure of the extracted aspect terms which is presented
Fig. 10. Proposed RAL Model-2 comparison with complete train and random in Table 8.
sampling results.
• Chen, Wang, Zhu and Liu (2022) a densely connected self-attention
in par with the model trained on the complete train data. With a training Bi-LSTM neural network model which takes inputs in the form of
data requirement of hardly 4 to 7% of the complete train data (as general word embeddings generated using Glove model, domain
observed in Table 6), the proposed model is able to achieve on par re­ specific embeddings using fastText and attempts to preserve the
sults with a performance dip of hardly 1.5 to 3.2 points in F-measure feature information across layers. Out of vocabulary word embed­
achieved by the model trained on the complete train data. dings have been generated using fastText.
Similarly, the proposed RAL Model-2 which is meant for evaluation • Kumar et al. (2021a) proposed a hierarchical self-attention Bi-LSTM-
of aspect term extraction models is compared with the results of training CRF network which requires less memory which takes input in the
on complete training data and the results of random sampling which is form of double embeddings, domain specific and pre-trained Glove
portrayed in Fig. 10. The performance gap in terms of F-measure of word embeddings. Self-attention incorporated at multiple levels
aspect terms between the proposed RAL Model-2 which is trained on the ensure identifying the most prominent tokens with respect to the
minimal sample set and the complete train data model is hardly 2 to 5 entire meaning of the sentence and explored interdependency among
points in F-measure. As observed in Table 7, the labelled data require­ words to identify co-located tokens.
ment of proposed RAL Model-2 is only 9% to 13% of the entire train • Yang, Zeng, Yang, Song and Xu (2021) is a multitask learning
data. It is also observed from Fig. 10 that RAL Model-2 outperforms framework that learns aspect term extraction and sentiment polarity
random sampling results by 9 to 17 points in F-measure across the based on local context features and self-attention taking BERT pre-
datasets which justifies the model strength. trained embeddings as inputs.

13
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Fig. 11. Precision and Recall of Proposed RAL-Model-2 and complete train model across datasets.

aspect and opinion, by explicitly modelling syntactic constraints


Table 8
among aspect term extraction and opinion term extraction to un­
Comparison of proposed model with supervised baselines on aspect term
cover relationships, which provides an optimal solution over the
extraction reported in F-measure.
neural predictions for both tasks. They claimed to be the first to
Model Rest-2014 Rest-2015 Rest-2016 Laptop-2014
explicitly model syntactic constraints in neural network-based ap­
Chen et al. (2022) – – 77.81 81.60 proaches for aspect and opinion terms co-extraction
Kumar et al. (2021a) 89.96 – – 84.01 • Xue, Zhou, Li, and Wang (2017) is a supervised classification method
Yang et al. (2021) 89.02 83.82
where aspect extraction is designed as a sequence labelling problem.
– –
Augustyniak et al. (2021) 86.05 – – 81.08
Akhtar et al. (2020) 83.36 – – 78.57 The architecture comprises of a Bi-LSTM network where a word
Ma et al. (2019) – – 75.14 80.31 embedding layer transforms the input words to real valued vectors.
Agerri et al. (2019) 84.11 70.90 73.51 –
Luo et al. (2019) 85.31 70.83 74.49 80.57
The proposed RAL Model-2 which utilises a maximum of 13% of the
Yu et al. (2018) 84.50 70.53 – 78.69
Xue et al. (2017) 83.65 67.73 72.95 –
training data is compared with recent supervised baselines trained on
Proposed RAL Model-2 85.63 67.52 73.61 78.67 the entire training data. The best results across all the four datasets have
been marked in bold. An observation into the baselines reported in the
years spanning from 2017 to 2022, reveals that only Luo et al. 2019 have
• Augustyniak et al. (2021) presented an ablation study for neural trained their model on all the four datasets. The proposed RAL Model-2
network models for aspect term extraction, where they also proved has achieved results almost in par with Luo et al. 2019, except for a 1–3
that incorporating both word and character embeddings as inputs to points dip in F-measure across datasets. The same observation holds for
a BI-LSTM-CRF model improves its performance and that simpler baselines which have trained their model on three datasets like Xue et al.
models perform almost as good as models with more than double its 2017, Agerri et al. 2019 and Yu et al. 2018 which have adopted
complexity. moderately complex architectures. The proposed RAL Model-2 has re­
• Akhtar et al. (2020) proposed a model which combines a Bi-LSTM ported results very close to these baselines and even outperformed them
and a CNN network that jointly learns aspect extraction and senti­ on certain datasets. The models which have reported the highest results,
ment classification. The Bi-LSTM predicts the aspect terms in the Chen et al. 2022 on Restaurant 2015 dataset have been explored for only
sentence and the CNN module utilizes the convolutional features of two datasets. Similarly, Kumar et al. 2021a which has reported the
the predicted aspect terms for sentiment classification. Both these highest results on Restaurant 2014 and Laptop 2014 datasets have been
modules work jointly and benefit each other. trained only on two datasets. Both these baselines have strengthened the
• Ma, Li, Wu, Xie, and Wang (2019) is a sequence labelling learning models taking inputs in the form of double embedding; generic and
framework, appended with a couple of additional components domain specific and have employed self-attention architectures which
namely position-aware attention and gated unit networks, which are justifies its superior performance. Utilising the most informative in­
used to capture features from the current word and its adjacent stances which barely account to 9.11% of the training data, the per­
words. It applies the learning from the meaning of the whole sen­ formance dip in the proposed model when compared to the best results
tence and the previous label during the decoding process. reported on Restaurant-2014 dataset is a marginal difference of 3 points
• Agerri and Rigau (2019) proposed a perceptron based algorithm and on F-measure. Similarly, the performance dip of the proposed model
employed three groups of features- local shallow orthographic fea­ trained on a minimal sample of Restaurant-2015 dataset is only around 3
tures, word shape and N-gram features, their context and simple points on F-measure in comparison to the best results reported. The fact
clustering features based on unigram matching. that the proposed model has brought down the training data require­
• Luo, Li, Liu, Wang and Unger (2019) proposed a bidirectional de­ ment to a bare minimum of 9% to 13% of the total volume and is still
pendency tree network that combines the information gained from able to showcase on par results with recent baselines is quite appre­
bottom-up and top-down propagation on the given dependency ciable. This ascertains the strength of the proposed model in sampling
syntactic tree. A complete framework is then developed to integrate the most informative instances that are almost representative of the
the embedded representations and Bi-LSTM along with CRF to learn entire training data.
both tree-structured and sequential features to solve the aspect term The comparison with recent baselines on aspect term extraction,
extraction problem. proves that the proposed model trained on a minimal informative
• Yu, Jiang, and Xia (2018) proposed a model to apply a multi-task sample is able to achieve almost on par results. The choice of the base
learning framework to implicitly capture the relations between model for the experiments has been a moderately simple BiLSTM-CRF

14
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

architecture, which establishes the strength of the sampling strategy. observed in Table 8. The single term aspect bucket of laptop-2014 is
another scenario where the difference is on the higher side. The larger
5.4. Fine grained analysis based on attribute aided evaluation proportion of unique single term aspects in laptop domain could have
posed a challenge to the proposed RAL Model-2 which is trained on a
In Section 5.2, the proposed RAL model was assessed and compared minimal sample. An interesting result has been reported on bucket2 and
with other models based on its reported F-measure on the complete test bucket3 of Laptop-2014 where the RAL Model-2 has outperformed the
data. A fine grained analysis based on attributes like Aspect Term Length, complete train model.
Out of Vocabulary words, Aspect Term Label Consistency etc. which are
likely to be challenging factors for a model with respect to aspect term 5.4.2. Fine grained analysis based on aspect term label consistency
extraction would give insights into which among these attributes actu­ A term labelled as an aspect in a review instance need not marked so
ally influence the model and assess how the proposed RAL model per­ in another instance if not opinionated. This inconsistency might influ­
forms in comparison to the complete train model based on these ence model training. For all the aspects present in the training data a
attributes. The idea of dividing the test beds into partitions based on measure called Aspect Term Label Consistency (ATLC) is derived as the
attributes that could challenge the model was suggested by Fu and team ratio of the number of labelled presence of the aspect to the total pres­
(Fu, Liu, & Neubig, 2020). Drawing inspiration from their work, three ence in the train data which is a value in the range (0,1]. Based on the
partitions/buckets have been taken from each test set partitioned based presence of these aspects terms and their consistency measure, three
on Aspect Term Length, Aspect Term Label Consistency and Out of Vocab­ buckets are prepared from the test data containing aspect terms with
ulary words and the proposed RAL Model-2 and the complete train data ATLC in three different ranges, (0,0.4], (0.4, 0.75], and (0.75, 1]
model have been evaluated on these buckets separately. respectively. The range of intervals are chosen so as to ensure a uniform
distribution of instances across buckets to the extent possible.
5.4.1. Fine grained analysis based on aspect term length The results of attribute based evaluations of the proposed RAL
It is often noticed that aspect terms containing more terms are more Model-2 and complete train data for the attribute Aspect Label Consis­
challenging for prediction and hence Aspect Term Length (ATL) is chosen tency have been showcased in Table 10. ATLC is found to have a sig­
as one of the features for this fine grained analysis. Review instances nificant positive correlation with label consistency i.e. the models
which contain only aspect terms of length one, two, three and above exhibited better performance on buckets which contains aspect terms
terms are segregated into three buckets. The number of instances in the with higher labelling consistency values. The results reported on
third bucket was quite low in comparison to the first two buckets across bucket3 which contain the test instances with the most consistent aspect
the datasets experimented. labels are comparatively much better which is observed consistently
The results of attribute based evaluations of the proposed RAL across datasets. The performance dip of the proposed RAL model in
Model-2 and complete train data for the attribute Aspect Term Length in comparison to the complete train model is more in buckets corre­
terms of F-measure is presented in Table 9. sponding to low consistency buckets; bucket1 and bucket2 as observed
As observed in Table 9, the attribute Aspect Term Length is found to in Table 10. This observation concludes that the proposed RAL Model-2
have a negative correlation with entity length across all the datasets which is trained on a minimal sample is constrained by inconsistent
except Laptop-2014 dataset. The models have reported lower perfor­ aspect term labels. The performance dip between the proposed RAL
mance in buckets corresponding to two words and multi word aspects Model-2 and the complete train model is comparatively low in Laptop-
(Bucket2 and Bucket3) in the majority scenarios. An investigation of the 2014 dataset in comparison to other datasets. An investigation of the
test instances in the laptop domain revealed that there is a higher pro­ buckets revealed that the number of test instances in the lower consis­
portion of unique single words aspects in Laptop-2014 dataset in com­ tency buckets were the lowest for Laptop-2014 indicating minimum
parison to other datasets making it more challenging. The weakness of presence of inconsistent aspect term labels in comparison to other
pre-trained embeddings utilized as inputs to represent the laptop datasets and hence imposing lesser challenges to RAL Model-2. An
domain which is a relatively distinct domain could be another reason for interesting result is reported in bucket3 of Restaurant-2016 dataset
comparative poor performance on laptop single term aspects. The dif­ where the proposed RAL model outperforms complete train model by a
ference in the performance between proposed RAL Model-2 and com­
plete train model across corresponding buckets is comparatively more
for single and two word buckets than the multi-word bucket as observed Table 10
in Table 9. The magnitude of these differences are highest in the Rest- Attribute based evaluation based on Entity Label Consistency.
2015 dataset with almost 4 points and 6 points in bucket1 and Dataset Model Bucket1 Bucket2 Bucket3ALC
bucket2 and this dip is observed in the overall F-Measure comparison ALC Interval: ALC Interval: Interval:
(0,0.4] (0.4,0.75] (0.75, 1)
depicted in Fig. 9. Rest-2015 is the most challenging dataset in SemEval
on which the lowest results have been reported by all the baselines as Rest- Complete 88.67 91.14 94.82
2014 Train
RAL Model- 81.27 87.83 93.76
Table 9 2
Attribute based evaluation based on Aspect Term Length. Difference 7.4 3.31 1.06
Rest- Complete 87.32 89.65 92.21
Dataset Model Bucket1 Bucket2 Bucket3
2015 train
(ATL = 1) (ATL = 2) (ATL>=3)
RAL Model- 85.28 87.27 90.86
Rest-2014 Complete Train 95.76 91.08 88.29 2
RAL Model-2 93.32 88.61 87.64 Difference 2.04 2.38 1.35
Difference 2.44 2.47 0.65 Rest- Complete 73.21 75.77 86.00
Rest-2015 Complete Train 90.22 87.09 86.76 2016 train
RAL Model-2 85.99 81.05 84.86 RAL Model- 69.18 73.76 87.08
Difference 4.23 6.04 1.9 2
Rest-2016 Complete train 87.4 82.14 80.68 Difference 4.03 2.01 − 1.08
RAL Model-2 84.57 79.69 80.31 Laptop- Complete 83.50 84.30 87.24
Difference 2.83 2.45 0.37 2014 train
Laptop-2014 Complete Train 82.72 86.16 87.53 RAL Model- 81.74 82.43 86.48
RAL Model-2 75.94 89.60 87.79 2
Difference 6. 78 − 3.44 − 0.26 Difference 1.76 1.87 0.76

15
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

point on F-measure. data. The challenge of the proposed RAL model in learning from a
minimal sample seems to be most evident in the Entity Label Consistency
5.4.3. Fine grained analysis based on out of vocabulary words attribute as it was observed that the performance gap between the two
Test instances that contain words not present in the train vocabulary models were consistency high across low entity label consistency
are expected to pose challenges to a machine learning model. But neural buckets. Similarly, even though not that consistent, the proposed RAL
networks that take inputs from word embeddings claim that they are model reported larger performance gaps on buckets with larger pro­
comparatively less influenced by the presence of out of vocabulary portion of out of vocabulary words. Surprising, for the attribute aspect
words. To investigate the claim and find if this attribute influences the term length the buckets corresponding to smaller entity lengths have
performance of the model, the proposed RAL model and the complete been more challenging for the proposed RAL model as observed by the
train model are tested separately on three buckets which are categorized larger performance dips. This could have attributed to the fact that the
by the proportion of Out of Vocabulary words (OOV) that they contain minimal optimal sample could semantically capture aspect terms with
which is calculated as the ratio of out of vocabulary words in the test more number of terms better than aspects with lesser terms.
instance to the total out of vocabulary words in the entire test data. The
three buckets contain test instances that have the ratio OOV in the 6. Conclusion
ranges (0,0.00125], [0.00125, 0.00375) and (0.00375, 0.01125]
respectively, ensuring a uniform distribution across buckets to the extent A reinforced active learning model for the task of aspect term
possible even though the third bucket has comparatively lesser extraction in sentiment analysis has been proposed which is capable of
instances. extracting a subset from the entire training data, almost representative
The results of attribute based evaluations of the proposed RAL of the entire sample in an unsupervised manner. The model has been
Model-2 and complete train data for the OOV based analysis has been experimented on SemEval datasets and has reported an appreciable
showcased in Table 11. As per the information in Table 11, out of vo­ performance both in terms of data reduction and performance metrics.
cabulary words (OOV) attribute is found to have a negative correlation The proposed RAL model outperforms random sampling by 9 to 17
with the model performance. The model performance is negatively points on F-measure reported on aspect term extraction across datasets.
influenced by increasing presence of out of vocabulary words in the test Hardly, 9% to 13% of the entire train data is utilized by the proposed
instances even though the influence is very minimal. The difference in model, which is a highly appreciable cost reduction. The proposed RAL
the performance between complete train and proposed RAL model model is able to achieve appreciable results in comparison to full su­
across corresponding buckets are more for bucket2 and bucket3 corre­ pervision which ascertains the quality of the optimal sample extracted
sponding to higher OOV ratio. This observation holds good for all by the proposed model.
datasets except Rest-2016 which has higher performance dip values In a real world scenario, the unlabeled instances need not be avail­
corresponding to buckets with low OOV ratio. This behavior could have able altogether at one shot, even though a large volume is available at
been influenced by other challenges embedded in the instances con­ the time of initial model building. A reinforcement learning algorithm
tained in those buckets. guarantees that retraining is not required as in the case of typical su­
The observations based on this fine grained attribute based evalua­ pervised models as it adapts to new environments automatically. Any
tion concludes that aspect term length and entity label consistency have minimal add-ons to the proposed reinforced active learning model to
significant influence on the performance of both the reinforced active ensure that this claim stands true can be a future direction. A patent
learning model and the complete train model. The out of vocabulary application projecting a generic version of the proposed methodology
words attribute has a comparatively lesser influence which supports the suitable for any text labelling application has been filed.4 There is al­
theory that deep learning models which take inputs in the form of word ways a tradeoff between the performance metric expected and the
embeddings are less challenged by out of vocabulary words. The fact annotation costs. Low resource languages would be benefited if they are
that the behavioural patterns of both the models are similar towards provided with a confidence score corresponding to the tradeoff between
these attributes ascertains that the optimal sample chosen by the rein­ performance and annotation costs. Incorporating this parameter in the
forced active learning model mimics the behaviour of the complete train work, could be another direction of exploration. The model can be
experimented on other languages too and would prove beneficial for low
Table 11 resource languages. The challenge would be the availability of qualita­
Attribute based evaluation based on Out of Vocabulary Words. tive pre-trained word embedding resources available for these lan­
Model Bucket1 Bucket2 Bucket3
guages. The generalizability of modern ML models is an inspiring
OOV Range: OOV Range: OOV Range: research challenge in the upcoming era where we expect the model
(0,0.00125] [0.00125, (0.00375, learned on a specific task to be capable of transfer learning for similar
0.00375) 0.01125] tasks. The possibility of generalizing the proposed model for any
Rest- Complete 90.32 88.99 87.83 sequential text labelling application could be another future direction.
2014 Train
RAL Model- 88.77 86.63 83.71
CRediT authorship contribution statement
2
Difference 1.55 2.36 4.12
Rest- Complete 90.09 89.87 89.31 Manju Venugopalan: Conceptualization, Methodology, Visualiza­
2015 Train tion, Investigation, Data curation, Software, Writing – original draft.
RAL Model- 88.33 87.20 86.66 Deepa Gupta: Conceptualization, Supervision, Validation, Writing –
2
Difference 1.76 2.67 2.65
review & editing.
Rest- Complete 77.56 71.25 66.86
2016 Train Declaration of Competing Interest
RAL Model- 73.04 67.70 68.22
2
Difference 4.52 3.55 − 1.36
The authors declare that they have no known competing financial
Laptop- Complete 83.87 80.64 77.65
2014 train
RAL Model- 82.58 75.96 73.28 4
Deepa Gupta, Manju Venugopalan, Peeta Basa Pati, “An Automated System
2
for Identifying an Optimal Set for Text Labelling”, 202141044434, 30th Sep
Difference 1.29 4.68 4.37
2021, Indian Provisional Patent application.

16
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

interests or personal relationships that could have appeared to influence Ishibashi, H., & Hino, H. (2020). Stopping criterion for active learning based on
deterministic generalization bounds. In International Conference on Artificial
the work reported in this paper.
Intelligence and Statistics (pp. 386-397). PMLR.
Jagarlamudi, J., Daume, H., III, & Udupa, R. (2012). Incorporating lexical priors into
References topic models. In In Proceedings of the 13th Conference of the European Chapter of the
Association for Computational Linguistics(pp (pp. 204–213).
Agerri, R., & Rigau, G. (2019). Language independent sequence labelling for opinion Joshi, A. J., Porikli, F., & Papanikolopoulos, N. (2009). In Multi-class active learning for
target extraction. Artificial Intelligence, 268, 85–95. image classification (pp. 2372–2379). IEEE.
Akhtar, M. S., Garg, T., & Ekbal, A. (2020). Multi-task learning for aspect term extraction Keneshloo, Y., Ramakrishnan, N., & Reddy, C. K. (2019). In Deep transfer reinforcement
and aspect sentiment classification. Neurocomputing, 398, 247–256. learning for text summarization (pp. 675–683). Society for Industrial and Applied
Alamoodi, A. H., Zaidan, B. B., Zaidan, A. A., Albahri, O. S., Mohammed, K. I., Mathematics.
Malik, R. Q., … Alaa, M. (2021). Sentiment analysis and its applications in fighting Kiani, J., Camp, C., Pezeshk, S., & Khoshnevis, N. (2020). Application of pool-based
COVID-19 and infectious diseases: A systematic review. Expert Systems with active learning in reducing the number of required response history analyses.
Applications, 167, Article 114155. Computers & Structures, 241, Article 106355.
Alamoudi, E. S., & Alghamdi, N. S. (2021). Sentiment classification and aspect-based Kim, D. K., Lee, B., Kim, D., & Jeong, H. (2020). Multi-Label Classification of Historical
sentiment analysis on yelp reviews using deep learning and word embeddings. Documents by Using Hierarchical Attention Networks. Journal of the Korean Physical
Journal of Decision Systems, 30(2–3), 259–281. Society, 76(5), 368–377.
Angluin, D. (2001). Queries revisited. In International Conference on Algorithmic Learning King, R. D., Whelan, K. E., Jones, F. M., Reiser, P. G., Bryant, C. H., Muggleton, S. H., …
Theory (pp. 12–31). Berlin, Heidelberg: Springer. Oliver, S. G. (2004). Functional genomic hypothesis generation and experimentation
Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J., & Agarwal, A. (2020). Deep Batch by a robot scientist. Nature, 427(6971), 247–252.
Active Learning by Diverse, Uncertain Gradient Lower Bounds. Proceedings of the 8th Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab, A. A., Yogamani, S., & Perez, P.
International Conference on Learning Representations. (2021). Deep reinforcement learning for autonomous driving: A survey. IEEE
Augustyniak, L., Kajdanowicz, T., & Kazienko, P. (2021). Comprehensive analysis of Transactions on Intelligent Transportation Systems, 1–18.
aspect term extraction methods using various text embeddings. Computer Speech & Koncz, P., & Paralic, J. (2013). Active learning enhanced document annotation for sentiment
Language, 69, Article 101217. analysis. In International Conference on Availability, Reliability, and Security (pp.
Balcan, M. F., Hanneke, S., & Vaughan, J. W. (2010). The true sample complexity of 345–353). Berlin, Heidelberg: Springer.
active learning. Machine Learning, 80(2), 111–139. Kumar, A., Veerubhotla, A. S., Narapareddy, V. T., Aruru, V., Neti, L. B. M., &
Basiri, M. E., Nemati, S., Abdar, M., Cambria, E., & Acharya, U. R. (2021). ABCDM: An Malapati, A. (2021). Aspect term extraction for opinion mining using a Hierarchical
attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Self-Attention Network. Neurocomputing, 465, 195–204.
Generation Computer Systems, 115, 279–294. Kumar, A., Verma, S., & Sharan, A. (2021). ATE-SPD: Simultaneous extraction of aspect-
Baxter, J., Tridgell, A., & Weaver, L. (2001). Machines that learn to play games, chapter term and aspect sentiment polarity using Bi-LSTM-CRF neural network. Journal of
Reinforcement learning and chess. Nova Science Publishers, 91–116. Experimental & Theoretical Artificial Intelligence, 33(3), 487–508.
Beluch, W. H., Genewein, T., Nurnberger, A., & Kohler, J. M. (2018). The power of Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From Word Embeddings to
ensembles for active learning in image classification. In In Proceedings of the IEEE Document Distances. In Proceedings of the 32nd International Conference on
conference on computer vision and pattern recognition (pp. 9368–9377). Machine Learning (pp. 957-966). PMLR.
Beygelzimer, A., Dasgupta, S., & Langford, J. (2009). In Importance-weighted active Lakshmi Devi, K., Subathra, P., & Kumar, P. N. (2016). In Performance evaluation of
learning (pp. 49–56). ACM Press. sentiment classification using query strategies in a pool based active learning scenario (pp.
Bouguelia, M. R., Belaid, Y., & Belaid, A. (2016). An adaptive streaming active learning 65–75). Singapore: Springer.
strategy based on instance weighting. Pattern Recognition Letters, 70, 38–44. Lewis, D. D., & Catlett, J. (1994). In Heterogeneous uncertainty sampling for supervised
Carbonneau, M. A., Cheplygina, V., Granger, E., & Gagnon, G. (2018). Multiple instance learning (pp. 148–156). Morgan Kaufmann.
learning: A survey of problem characteristics and applications. Pattern Recognition, Li, J., Sun, A., & Ma, Y. (2020). Neural named entity boundary detection. IEEE
77, 329–353. Transactions on Knowledge and Data Engineering, 33(4), 1790–1795.
Chaudhary, A., Anastasopoulos, A., Sheikh, Z., & Neubig, G. (2021). Reducing Confusion Li, M., & Sethi, I. K. (2006). Confidence-based active learning. IEEE Transactions on
in Active Learning for Part-Of-Speech Tagging. Transactions of the Association for Pattern Analysis and Machine Intelligence, 28(8), 1251–1261.
Computational Linguistics, 9, 1–16. Li, Y., Wang, X., Liu, H., Pu, L., Tang, S., Wang, G., & Liu, X. (2021). Reinforcement
Chen, C., Wang, H., Zhu, Q., & Liu, J. (2022). Densely-connected neural networks for Learning based Resource Partitioning for Improving Responsiveness in Cloud
aspect term extraction. Science China Information Sciences, 65(6), 1–3. Gaming. IEEE Transactions on Computers, 14(8), 1049–1062.
Cing, D. L., & Soe, K. M. (2020). Improving accuracy of part-of-speech (POS) tagging Li, Y., Fan, B., Zhang, W., Ding, W., & Yin, J. (2021). Deep active learning for object
using hidden markov model and morphological analysis for Myanmar Language. detection. Information Sciences, 579, 418–433.
International Journal of Electrical and Computer Engineering, 10(2), 2023. Lin, C. H., Mausam, M., & Weld, D. S. (2016). Re-active learning: Active learning with
Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. relabelling. In In Proceedings of Thirtieth AAAI Conference on Artificial Intelligence (pp.
Machine learning, 15(2), 201–221. 1845–1852).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep Liu, Z., Jiang, X., Luo, H., Fang, W., Liu, J., & Wu, D. (2021). Pool-based unsupervised
bidirectional transformers for language understanding, In Proceedings of NAACL- active learning for regression using iterative representativeness-diversity
HLT (pp. 4171-4186). ACM. maximization (iRDM). Pattern Recognition Letters, 142, 11–19.
Fang, M., Li, Y., & Cohn, T. (2017). Learning how to active learn: A deep reinforcement Lughofer, E. (2012). Hybrid active learning for reducing the annotation effort of
learning approach. In Proceedings of EMNLP (pp. 595-605). ACM. operators in classification systems. Pattern Recognition, 45(2), 884–896.
Fu, J., Liu, P., & Neubig, G. (2020). Interpretable multi-dataset evaluation for named Luo, H., Li, T., Liu, B., Wang, B., & Unger, H. (2019). Improving aspect term extraction
entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in with bidirectional dependency tree representation. IEEE/ACM Transactions on Audio,
Natural Language Processing (pp. 6058-6069). ACM. Speech, and Language Processing, 27(7), 1201–1212.
Gandhi, H., & Attar, V. (2020). Extracting aspect terms using CRF and bi-LSTM models. Lykouris, T., Simchowitz, M., Slivkins, A., & Sun, W. (2021). Corruption-robust
Procedia Computer Science, 167, 2486–2495. exploration in episodic reinforcement learning. In Conference on Learning Theory
Gilad-Bachrach, R., Navot, A., & Tishby, N. (2005). Query by committee made real. (pp. 3242-3245). PMLR.
Advances in neural information processing systems, 18, 443–450. Ma, D., Li, S., Wu, F., Xie, X., & Wang, H. (2019). Exploring sequence-to-sequence
Guo, J., Pang, Z., Bai, M., Xie, P., & Chen, Y. (2021). Dual generative adversarial active learning in aspect term extraction. In In Proceedings of the 57th Annual Meeting of the
learning. Applied Intelligence, 51(8), 5953–5964. Association for Computational Linguistics (pp. 3538–3547).
Guo, Y., & Schuurmans, D. (2008). In Efficient global optimization for exponential family Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and
PCA and low-rank matrix factorization (pp. 1100–1107). IEEE. applications: A survey. Ain Shams engineering journal, 5(4), 1093–1113.
Hadano, M., Shimada, K., & Endo, T. (2011). Aspect identification of sentiment sentences Miao, Y. L., Cheng, W. F., Ji, Y. C., Zhang, S., & Kong, Y. L. (2021). Aspect-based
using a clustering algorithm. Procedia-Social and Behavioral Sciences, 27, 22–31. sentiment analysis in Chinese based on mobile reviews for BiLSTM-CRF. Journal of
Hajmohammadi, M. S., Ibrahim, R., Selamat, A., & Fujita, H. (2015). Combination of Intelligent & Fuzzy Systems (Preprint), 1–11.
active learning and self-training for cross-lingual sentiment classification with Mir, J., & Mahmood, A. (2020). Movie Aspects Identification Model for Aspect Based
density analysis of unlabelled samples. Information sciences, 317, 67–77. Sentiment Analysis. Information Technology and Control, 49(4), 564–582.
Hantke, S., Zhang, Z., & Schuller, B. W. (2017). Towards Intelligent Crowdsourcing for Moschitti, A. (2006). Making tree kernels practical for natural language learning. In 11th
Audio Data Annotation: Integrating Active Learning in the Real World. In conference of the European Chapter of the Association for Computational Linguistics (pp.
INTERSPEECH (pp. 3951-3955). 113–120).
Hemmer, P., Kuhl, N., & Schoffer, J. (2022). Deal: deep evidential active learning for Nair, P. C., Gupta, D., Devi, B. I., & Bhat, N. R. (2019). Automated clinical concept-value
image classification. In Deep Learning Applications, Volume 3 (pp. 171-192). pair extraction from discharge summary of pituitary adenoma patients. In 2019 9th
Springer, Singapore. International Conference on Advances in Computing and Communication (ICACC)
Hoi, S. C., Jin, R., & Lyu, M. R. (2006). In Large-scale text categorization by batch mode (pp. 258-264). IEEE.
active learning (pp. 633–642). ACM Press. Obiedat, R., Al-Darras, D., Alzaghoul, E., & Harfoushi, O. (2021). Arabic Aspect-Based
Hu, R., Namee, B. M., & Delany, S. J. (2016). Active learning for text classification with Sentiment Analysis: A Systematic Literature Review. IEEE Access, 9, 152628–152645.
reusability. Expert Systems with Applications, 45, 438–449. Pal, S., Gupta, Y., Shukla, A., Kanade, A., Shevade, S., & Ganapathy, V. (2020).
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence ACTIVETHIEF: Model Extraction Using Active Learning and Unannotated Public
tagging. arXiv preprint arXiv:1508.01991. Data. In Proceedings of the AAAI Conference on Artificial Intelligence, 34(1), 865–872.

17
M. Venugopalan and D. Gupta Expert Systems With Applications 209 (2022) 118228

Pan, X., Wei, D., Zhao, Y., Ma, M., & Wang, H. (2020). Self-Paced Learning with Diversity Wang, Y., Won, K. S., Hsu, D., & Lee, W. S. (2012). Monte carlo bayesian reinforcement
for Medical Image Segmentation by Using the Query-by-Committee and Dynamic learning. International conference on Machine Learning.
Clustering Techniques. IEEE Access, 9, 9834–9844. Wei, Q., Chen, Y., Salimi, M., Denny, J. C., Mei, Q., Lasko, T. A., … Xu, H. (2019). Cost-
Park, S., Lee, W., & Moon, I. C. (2015). Efficient extraction of domain specific sentiment aware active learning for named entity recognition in clinical text. Journal of the
lexicon with active learning. Pattern Recognition Letters, 56, 38–44. American Medical Informatics Association, 26(11), 1314–1322.
Ren, P., Xiao, Y., Chang, X., Huang, P. Y., Li, Z., Gupta, B. B., Chen, X., & Wang, X. Wu, Z., & Xiao, L. (2019). A structure with density-weighted active learning-based model
(2021). A survey of deep active learning. ACM Computing Surveys, 4(9), 1–40. selection strategy and meteorological analysis for wind speed vector deterministic
Schein, A. I., & Ungar, L. H. (2007). Active learning for logistic regression: An evaluation. and probabilistic forecasting. Energy, 183, 1178–1194.
Machine Learning, 68(3), 235–265. Wu, X., Chen, C., Zhong, M., & Wang, J. (2021). HAL: Hybrid active learning for efficient
Settles, B., Craven, M., & Ray, S. (2007). Multiple-instance active learning. In Advances in labeling in medical domain. Neurocomputing, 563–572.
Neural Information Processing Systems (NIPS), 20, 1289–1296. Wu, X., Chen, C., Zhong, M., Wang, J., & Shi, J. (2021). COVID-AL: The diagnosis of
Settles, B., & Craven, M. (2008). In An analysis of active learning strategies for sequence COVID-19 with deep active learning. Medical Image Analysis, 68, Article 101913.
labeling tasks (pp. 1070–1079). ACL Press. Xiao, X., Wang, F., Shahidehpour, M., Li, Z., & Yan, M. (2020). Coordination of
Shen, Y., Yun, H., Lipton, Z. C., Kronrod, Y., & Anandkumar, A. (2018). Deep active distribution network reinforcement and DER planning in competitive market. IEEE
learning for named entity recognition. In International Conference on Learning Transactions on Smart Grid, 12(3), 2261–2271.
Representations (pp. 252–256). Xiaolong, J. W. G. Y. W. (2006). Conditional random fields based pos tagging. Computer
Shim, H., Lowet, D., Luca, S., & Vanrumste, B. (2021). LETS: A Label-Efficient Training Engineering and. Applications.
Scheme for Aspect-Based Sentiment Analysis by Using a Pre-Trained Language Xue, W., Zhou, W., Li, T., & Wang, Q. (2017). MTNA: A neural multi-task model for
Model. IEEE Access, 9, 115563–115578. aspect category classification and aspect term extraction on restaurant reviews. In
Shui, C., Zhou, F., Gagné, C., & Wang, B. (2020). Deep Active Learning: Unified and Proceedings of the Eighth International Joint Conference on Natural Language Processing
Principled Method for Query and Training. In International Conference on Artificial (Volume 2: Short Papers) (pp. 151–156).
Intelligence and Statistics for query and training. In International Conference on Artificial Yang, H., Zeng, B., Yang, J., Song, Y., & Xu, R. (2021). A multi-task learning model for
Intelligence and Statistics (pp. 1308–1318). PMLR. chinese-oriented aspect polarity classification and aspect term extraction.
Shyam Sundar, K., & Gupta, D. (2020). In Active Learning Enhanced Sequence Labeling for Neurocomputing, 419, 344–356.
Aspect Term Extraction in Review Data (pp. 349–361). Singapore: Springer. Yu, H. (2005). SVM selective sampling for ranking with application to data retrieval.
Smailovic, J., Grcar, M., Lavrac, N., & Znidarsic, M. (2014). Stream-based active learning In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge
for sentiment analysis in the financial domain. Information sciences, 285, 181–203. discovery in data mining (pp. 354-363). ACM Press.
Smatana, M., Koncz, P., Smatana, P., & Paralic, J. (2013). In Active learning enhanced Yu, J., Jiang, J., & Xia, R. (2018). Global inference for aspect and opinion terms co-
semi-automatic annotation tool for aspect-based sentiment analysis (pp. 191–194). IEEE. extraction based on multi-task neural net-works. IEEE/ACM Transactions on Audio,
Sutton, R. S., & Barto, A. G. (1998). Introduction to reinforcement learning. (Vol. 135). Speech, and Language Processing, 27(1), 168–177.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. Zhang, D., Xia, C., Xu, C., Jia, Q., Yang, S., Luo, X., & Xie, Y. (2020). Improving Distantly-
Tandra, S., Nautiyal, A., & Gupta, D. (2020). An efficient text labeling framework using Supervised Named Entity Recognition for Traditional Chinese Medicine Text via a
active learning model. In Intelligent Systems, Technologies and Applications (pp. Novel Back-Labeling Approach. IEEE Access, 8, 145413–145421.
141–155). Singapore: Springer. Zhang, H. T., Huang, M. L., & Zhu, X. Y. (2012). A unified active learning framework for
Vandoni, J., Aldea, E., & Le Hegarat-Mascle, S. (2019). Evidential query-by-committee biomedical relation extraction. Journal of Computer Science and Technology, 27(6),
active learning for pedestrian detection in high-density crowds. International Journal 1302–1313.
of Approximate Reasoning, 104, 166–184. Zhang, K., Xie, Y., Yang, Y., Sun, A., Liu, H., & Choudhary, A. (2014). Incorporating
Venugopalan, M., & Gupta, D. (2015). Exploring sentiment analysis on twitter data. In conditional random fields and active learning to improve sentiment identification.
Proceedings of the eighth international conference on contemporary computing (IC3) (pp. Neural Networks, 58, 60–67.
241–247). IEEE. Zhang, K., Li, Y., Wang, J., Cambria, E., & Li, X. (2021). Real-Time Video Emotion
Venugopalan, M., Gupta, D., & Bhatia, V. (2021). A Supervised Approach to Aspect Term Recognition based on Reinforcement Learning and Domain Knowledge. IEEE
Extraction Using Minimal Robust Features for Sentiment Analysis. In Progress in Transactions on Circuits and Systems for Video Technology.
Advanced Computing and Intelligent Engineering (pp. 237–251). Singapore: Springer. Zhu, J., Wang, H., Tsou, B. K., & Ma, M. (2009). Active learning with sampling by
Venugopalan, M., & Gupta, D. (2022). An enhanced guided LDA model augmented with uncertainty and density for data annotations. IEEE Transactions on audio, speech, and
BERT based semantic strength for aspect term extraction in sentiment analysis. language processing, 18(6), 1323–1331.
Knowledge-Based Systems, 246, Article 108668. Zhu, J., Wang, H., & Hovy, E. (2008). Multi-criteria-based strategy to stop active learning
Vlachos, A. (2008). A stopping criterion for active learning. Computer Speech & Language, for data annotation. In Proceedings of the 22nd International Conference on
22(3), 295–312. Computational Linguistics-Volume 1 (pp. 1129–1136).
Wang, D., & Fan, X. (2009). Named entity recognition for short text. Journal of Computer Zhu, J., Wang, H., Hovy, E., & Ma, M. (2010). Confidence-based stopping criteria for
Applications, 29(1), 143–145. active learning for data annotation. ACM Transactions on Speech and Language
Wang, X., Wan, L., & Zhang, J. (2019). An Active Learning Framework Based on Query- Processing (TSLP), 6(3), 1–24.
By-Committee for Sentiment Analysis. In 2019 IEEE International Conference on Zou, W., Huang, S., Xie, J., Dai, X., & Chen, J. (2019). A reinforced generation of
Artificial Intelligence and Computer Applications (ICAICA) (pp. 327–331). IEEE. adversarial examples for neural machine translation. In Proceedings of the 58th
Wang, L., Hu, X., Yuan, B., & Lu, J. (2015). Active learning via query synthesis and Annual Meeting of the Association for Computational Linguistics (pp. 3486–3497).
nearest neighbour search. Neurocomputing, 147, 426–434.

18

You might also like