Majumder2021 Article InterpretableSemanticTextualSi

Applied Intelligence
https://doi.org/10.1007/s10489-020-02144-x
Interpretable semantic textual similarity of sentences using

alignment of chunks with classification and regression
Goutam Majumder1 · Partha Pakray2 · Ranjita Das3 · David Pinto4
Accepted: 12 December 2020

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021
Abstract
The proposed work is focused on establishing an interpretable Semantic Textual Similarity (iSTS) method for a pair of
sentences, which can clarify why two sentences are completely or partially similar or have some variations. This proposed
interpretable approach is a pipeline of five modules that begins with the pre-processing and chunking of text. Further chunks
of two sentences are aligned using a one–to–multi (1:M) chunk aligner. Thereafter, support vector, Gaussian Naive Bayes and
k–Nearest Neighbours classifiers are then used to create a multiclass classification algorithm, and different class labels are
used to define an alignment type. At last, a multivariate regression algorithm is developed to find the semantic equivalence
of an alignment with a score (that ranges from 0 to 5). The efficiency of the proposed method is verified on three different
datasets and also compared to other state–of–the–art interpretable STS (iSTS) methods. The evaluated results show that the
proposed method performs better than other iSTS methods. Most importantly, the modules of the proposed iSTS method are
used to develop a Textual Entailment (TE) method. It is found that, when we combined chunk level, alignment, and sentence
level features the entailment results significantly improves.
Keywords Semantic textual similarity · Natural language understanding · Text classification · Multivariate regression
1 Introduction semantic relations (such as synonyms, antonym, entailment,

paraphrase, etc.) between a pair of documents or text
Semantic Textual Similarity (STS) is a measure of semantic segments [25]. Developing an effective STS method plays
equivalence between a pair of text segments. In Natural a significant role in various NLP applications, such as text
Language Processing (NLP) it is defined by a metric of summarization [45], word sense disambiguation [32, 47]
and document classification [53]. It also contributes to the
field of Information Retrieval (IR), where a similarity score
Goutam Majumder between a set of documents and image captions is used to
goutam.nita@gmail.com retrieve similar documents and images [15].
Partha Pakray Identification of STS in short texts was proposed in
parthapakray@gmail.com 2006 in the works reported in [33, 38]. After that, focus
was shifted on large documents or individual words. After
Ranjita Das that, since 2012 the task of semantic similarity is not only
ranjita.nitm@gmail.com
limited to finding out the similarity between two texts, but
David Pinto also to generate a similarity score from 0 to 5 by different
dpinto@cs.buap.mx SemEval tasks1 [2–4]. In this task, a scale of 0 means
unrelated and 5 means complete semantically equivalence.
1 Lovely Professional University, Phagwara, Punjab, India Since its inception, the problem has seen a large number of
2 solutions in a relatively small amount of time. The central
National Institute of Technology Silchar,
Silchar, Assam, India idea behind most of the solution is the identification and
3 alignment of semantically similar or related words across
National Institute of Technology Mizoram, Aizawl,
Mizoram, India
4 Benemérita Universidad Autónoma de Puebla,
Puebla, Mexico 1 http://ixa2.si.ehu.es/stswiki/index.php/Main Page
G. Majumder et al.
the two sentences and the aggregation of these similarities sentence for alignment and a score 0 (zero) is assigned
to generate an overall similarity [24, 38, 52]. to the chunk. The similarity and relatedness score can be
The proposed method not only measures the semantic anything from 0 (means corresponding chunk is not present
equivalence but also adds an interpretability layer on top of for alignment) to 5 (means semantically similar and shares
the similarity score for a pair of text segments. This layer is the same information).
designed by finding the similarities and differences between Model interpretability is widely used in NLP, and
the segments (chunks) of two sentences. In the literature, an Interpretable Vector Space Model (VSM) is a kind
adding an interpretability layer to a similarity score has been of model that allows deeper exploration of semantic
identified as interpretable STS (iSTS) [5, 6, 34]. Next, we compositions [19]. Ritter et al. [44] prove that they
define an iSTS method by considering a pair of sentences of can infer categories that humans can easily understand,
image caption dataset:2 and Fyshe et al. suggest that the dimensions of their
word representations correspond to easily interpretable
– A cat standing on tree branches.
definitions [19]. Explanations in the teaching domain are
– A black and white cat is high up on tree branches.
important, where Intelligent Tutoring Systems (ITSs) strives
The goal of this proposed work is to measure a similarity to provide feedback beyond correct/incorrect judgments.
score, as well as to explain the similarities and differences In most cases, the systems rely on expensive domain–
between sentences. The output of an iSTS method would be dependent and question–dependent knowledge [7, 27], but
something like the following: some scalable alternatives based on generic NLP techniques
The above image captions describe information about the are also available [40].
content of an image. “The first sentence talks about a cat that The organisation of the paper is as follows: about
is standing on tree branches, and the second caption gives the novelty of the paper is reported in Section 2. The
the same information in a more specific way. The second related work about interpretable STS is reported in
caption provides more information about the cat as a black Section 3. Further, the proposed methodology with its
and white cat. Further, it also tells that, the cat is way above various components is discussed in Section 4. Section 5
on the tree.” highlights the detailed evaluation and performance of
So, giving this kind of detailed information is natural for various iSTS components and detail information about the
a human, but designing an automated algorithm that can datasets with various statistics is reported in Section 5.1. A
produce a human–like performance can be considered as comparative analysis with other state–of–the–art methods is
a Natural Language Understanding (NLU) problem and is discussed in Section 6. The usability of the proposed iSTS
useful in an application such as a dialogue system or an modules is tested with an NLP task (Textual Entailment
Intelligent Tutoring System (ITS). recognition) and reported in Section 7, and the conclusions
In this paper, a method is developed to produce an of the proposed work are reported in Section 8.
interpretability layer of a sentence pair. The proposed
method is based on the alignment of chunks of two
sentences and provides a relation type with a similarity 2 Major findings
score of each alignment. The proposed method is trained
and tested over three datasets of annotated alignments The important contribution of the proposed interpretable
of sentence pairs. It can provide the similarity and semantic textual similarity for finding the similarities and
dissimilarities in the form of chunk alignment, which is differences between the sentences are summarized below:
depicted in Fig. 1.
Figure 1 shows a representation of an interpretability – In the sense of STS, it formalizes the interpretability
layer for a pair of sentences, which includes chunks of each layer as a weighted and typed alignment among
sentence with alignments, relation type, and a similarity segments in the two sentences.
score. Two sentences are divided into small segments (called – We have developed four independent components i)
chunks); these are aligned as follows: “A black and white chunking (reported in Section 4.1); ii) chunk alignment
cat” is more specific (SPE2) than “A cat” with a score of (reported in Section 4.2); iii) measuring alignment
4, “standing” is equivalent to “is” with score of 5 (EQUI), score, called scoring (reported in Section 4.3) and
and “on tree branches” of the first sentence is equivalent iv) assigning a alignment type, called classification
to “on tree branches” of the second sentence with a score (reported in Section 4.4) to address the interpretable
of 5 (EQUI). The chunk “high up” is left as unaligned STS (iSTS).
(NOALI) because no such chunk is available in the first – For assigning an alignment type and measuring the
alignment score, we have developed two supervised
2 https://alt.qcri.org/semeval2015/task2/index.php?id=data-and-tools model using three types of feature values. We have
Measuring iSTS of sentences using alignment, classification and regression...
Fig. 1 Graphical representation

of the interpretability layer of an
iSTS method with aligned
chunks, alignment type and
alignment score
incorporated the string similarities with phrase embed- workshops. In this workshop, a set of paired sentences
dings and lexical features to train the supervised model. was given, and the participating systems were asked to
– We have shown why a classifier–based chunker produce a graded similarity score ranging from 0 (which
identifies chunks better than a rule–based chunker. means completely unrelated) to 5 (in which sentences are
When we compared the proposed iSTS method semantically equivalent) [2–4, 6].
with other state–of–the–art methods, it is found that Interpretable STS (iSTS) was first introduced in the
others have used either a ready–to–use chunker or semantic evaluation (SemEval) workshops, and participat-
gold standard chunks are considered for experimental ing teams were asked to develop an iSTS method, which
purpose. is similar to the proposed work. In 2015, iSTS was intro-
– The proposed method, is trained and tested with three duced as a pilot task with a STS task, and it was tested
different annotated datasets of news headlines, image whether a STS method can explain the similarities and dif-
captions and student answer. The reported system ferences between a pair of sentences [6]. In the given task,
results are well above the state-of-the-art methods. two sentences were taken, and the following methodology
– In order to judge the usefulness of the proposed iSTS was followed to develop an iSTS method:
method, we have formalized a Textual Entailment (TE)
– First, two sentences (s1 , s2 ) were divided into smaller
method (discussed in Section 7) based on the features
fragments called chunks by considering the CONLL
extracted at chunk level (CL) and sentence level (SL) of
2000 shared task chunking rules [46]. During chunking,
the entailing (T) and hypothesis text (H). In addition to
the main clause was split into smaller chunks such as
this, the alignment score between the chunks of T − H
noun phrases (NPs), prepositional phrases (PPs), verb
pair is also considered as a feature and a Naive Bayes
chains, and expressions (e.g., once upon a time), and
classifier is trained with the same set of parameters
subordinate clauses were considered as a single chunk
which is used for identification of the alignment type.
(e.g., when I read).
– Next, chunks of s1 were aligned with a corresponding
chunk of s2 . During alignment, only one chunk of s1 can
3 Background work about iSTS
be aligned with a chunk of s2 . If a chunk poses multiple
alignments, then the corresponding chunk with the
In academics, adding explanation in the context of
strongest relation was considered, and others remained
an Intelligent Tutoring System (ITS) aims to provide
unaligned. During alignment, the interpretation and
immediate feedback to the learners [7, 27]. ITSs are like
rationality of the whole sentence was considered.
a cognitive tutor that adopts the intellectual model to
– Further, each alignment was tagged with a relation
represent a problem like the human brain and provide
type, and a set of eight (8) alignment types was used.
feedback to the learners [8, 48]. For the same problem,
The leftover chunk of the previous step was tagged
an NLP–based solution is also available, which develops
with the “ALIC” (aligned to context) relation type.
a Textual Entailment (TE) method to provide fine–grained
“NOALI” was assigned to all unaligned chunks, where
analysis of a student’s answer against a reference answer
no corresponding chunk was found.
[40]. The proposed work develops a different methodology
– Last, a similarity or relatedness score ranging from 0 to
using classification and regression algorithms that provide
5 was assigned to each alignment.
feedback to the learners using an interpretability layer.
The proposed work is related to the area of NLU in For this pilot task, datasets of news headlines and
which two NLP tasks, STS and TE, are incorporated image captions were considered. Participating systems were
together to evaluate the quality of semantic representation. allowed to submit three (3) runs, and submitted systems
Semantic Textual Similarity has been focused on since are evaluated in terms of four (4) different performance
the inception of several Semantic Evaluation (SemEval) measures (discussed in Section 6). However, in SemEval
G. Majumder et al.
2016, iSTS was introduced as standalone task [5], and the 4.1.1 Rule–based chunker
following changes were incorporated with the 2015 pilot
task: The rule-based chunker begins with the word segmentation of
each sentence using a tokeniser. Next, each token is tagged with
– Like the main clause, subordinate clauses were also part-of-speech tags, which will prove to be very helpful in the
divided into smaller chunks. next step. In the third step, chunk rules are used to extract the
– During alignment, a chunk of s1 could be aligned with bigger chunks from the sentence as a chunk string. Thereafter,
multiple chunks (1:M) of s2 , which means that the (1:1) chunk strings are divided into smaller chunks by applying
restriction was removed. defined chunking rules. The chunking rules are extracted from
– Due to the 1:M alignment, the “ALIC” relation type was gold standard (gs) chunk files, which are identified and
lifted, and two (2) new relation types, “FACT” (means prepared by human annotators and available for experiment
factuality) and “POL” (means polarity), were added. purpose.3 An example of gs chunking of a pair of sentences
– In addition to the previous two datasets, the student is given in Table 1.
answer dataset was also added. To extract the chunking rules gs files are further
processed and converted to IOB tag format. During
The proposed method adapts the 2016 guidelines to conversion, ‘[’ and ‘]’ represent the beginning and ending
develop an iSTS method that is an improvement relative of a chunk. Further, these IOB–tagged files are considered,
to all the systems that participated in the 2016 task, as and chunk patterns are extracted to develop a chunker. An
described in Section 6. Detailed information about the example of such a format is shown in Fig. 3.
relation type and similarity score is provided in Sections 4.3 The top level represents a token separated by a space. The
and 4.2 respectively. second level shows the respective part–of–speech (pos) tags
for each token, and the third level shows the chunking rules.
The fourth level shows the chunks composed of one or more
4 Methodology tokens. The fifth level identifies the chunk types as NP, VP
and PP. In the last level, we have converted the chunks to
An iSTS method is composed of five modules, which is IOB tag format, in which ‘B’ means beginning of a chunk,
shown in Fig. 2. A pair of sentence (s1 , s2 ) is given as ‘I’ means inner token of the previous chunk and ‘O’ means
an input and the first module input handling and chunking tokens are not part of any chunks.
(described is Section 4.1), pre–processes s1 and s2 and Further, to find the chunk structure of a particular
further divides them into chunks. Thereafter, the pre– sentence, the chunker starts with a flat structure i.e. with
processed sentence pair with chunks was taken as input to chunking free tokens. The chunking rules are applied
the alignment module, which is composed of two aligner to updating the chunk structure successively. Thus the
as one–one (1:1) and one–multi (1:M) aligner (described in resulting chunk structure is returned, when all the rules have
Section 4.2). been invoked. Sample output of rule-based chunker has been
At the end of alignment, all the aligned pairs are passed listed in Table 2. Examples are taken from three different
independently to the measuring similarity and relatedness datasets and each segment of the Table 2, represents an
score (in short scoring) module (described in Section 4.3) input with its chunking rules. The third column shows the
to measure the alignment score. Thereafter, alignments with identified chunks and last shows its type.
a scaled predicted score of the previous module are passed
to the classification module for assigning an alignment type 4.1.2 Chunking with maximum entropy (MaxEnt) classifier
(described in Section 4.4). Finally, pairs of sentences with
aligned chunks, alignment types and scores are used to The rule–based chunker decides what chunks should be
explain the similarities and differences between s1 and s2 , created, entirely based on pos tags. However, sometimes
which is labelled as an interpretability layer of an iSTS pos tags are insufficient for determining the chunks of a
method. sentence. To understand this, the following example has
been considered:
4.1 Chunking
– Four/DT people/NN sitting/VBG at/IN a/DT
table/NN ./.
According to Abney, parsing by chunks has distinct
– Two/CD smiling/VBG women/NN holding/VBG
processing advantages, which helps to explain the reason for
a/DT baby/NN ./.
adopting a chunk–by–chunk strategy by the human parser
[1]. In this work, a rule–based chunker is developed and
compared to a classifier–based chunker. 3 http://alt.qcri.org/semeval2016/task2/index.php?id=data-and-tools
Fig. 2 Various components of

proposed interpretable semantic
textual similarity (iSTS) method
Table 1 A pair of gold standard chunks with sentences taken from news headline dataset
Pair of Sentences Gold Standard Chunks
George W Bush weighs into immigration debate [ George W Bush ] [ weighs into ] [ immigration debate ]
George W. Bush warns against bitter immigration debate [ George W. Bush ] [ warns against ] [ bitter immigration debate ]
Fig. 3 Step–wise output of

various preprocessing stages STEP - 1 A DOG NAPPING UNDER
R A SMALL TABLE
T Tokenization
with rule based chunking and
STEP - 2 DT NN VBG IN DT JJ NN pos tagging
IOB tags based on chunking
rules STEP - 3 DT NN VBG chunking
IN DT JJ NN
rules
STEP - 4 A DOG NAPPING UNDER A SMALL TABLE chunks
STEP - 5 NP VP PP chunk type

STEP - 6 B-NP I-NP B-VP B-PP I-PP I-PP I-PP IOB tags
Table 2 Output of rule-based chunker with required chunking rules and chunking types
Dataset Input with Chunking Rules Chunk Chunk Type
News headlines 2 French Journalists Killed in Mali

{<CD><JJ><NNPS>} [ 2 French Journalists] NP
{<VBN>} [ Killed ] VP
{<IN><NNP>} [ in Mali ] PP
Image captions A man sitting in a cluttered room

{<DT><NN>} [ A man ] NP
{<VBG>} [ sitting ] VP
{<IN><DT><JJ><NN>} [ in a cluttered room ] PP
Student Answer Bulb C was in an open path

{<NNP><NNP>} [ Bulb C ] NP
{<VBD>} [ was ] VP
{<IN><DT><JJ><NN>} [ in an open path ] PP
G. Majumder et al.
The above mentioned examples have one pos tag in chunks in IOB tag
common as ‘VBG’. For the first sentence, sitting marked as format
training
Feature Types
MaxEnt
training
classifier
a separate ‘VP’ chunk based on its pos information and its
(pos tag, words)
contextual position. But in the second example, the ‘VBG’ tokenized text
tag marked with two different tokens, smiling and holding. adjacent tags of
current token save
model
For the second example, if only pos tags are considered, then Feature
Vector
lookahead features of
smiling must be part of a separate chunk ‘VP’. tokenized text
each token
lodding
To maximise the performance of a chunker, we should model
testing
string of pos tags paird
with most recent tags
use the information about the words’ content. For this chunks in IOB tag
purpose, a classifier is adopted to incorporate the contextual format
meaning of a word. It will return the chunks as IOB tags.

Fig. 4 Training and testing of proposed classifier based chunker
Figure 4 shows the workflow of the proposed classifier–
based chunker.
At first, the sentence is tokenised, and the following sets The MaxEnt model has the following conditional form:
of features are extracted from it:
1
– Pos tags with words, by which the classifier can learn P (w|h) = · e i λi fi (h,w) (1)
from both word and pos tags (the rule–based chunker Z (h)
only consider pos tags).
– Thereafter, interactive information between the adjacent The P (w|h) in (1) represents a probability distribution P
tags is passed to the classifier by providing the previous of a class variable w for a given set of features h and
pos tag information. the parameter λi represents the importance of each feature
Further, to get the contextual features for each token, the fi (h, w) with a normalisation factor Z (h).
following features are considered:
4.2 Alignment
– Lookahead features for each token.
– A string of pos tags paired with the most recent tags to To align chunks in a pair of sentences, the alignment module
get the contextual evidence of each token. follows a three–step process. First, sentence–level tokens
– Tags such as determiner (DT), coordinating conjunction are aligned using a monolingual word aligner [49]. In the
(CC), and personal (PRP) and possessive (PRP$) second step, based on the token alignment information
pronouns are considered for this purpose. The Penn of the previous step, a chunk of one sentence is aligned
Treebank pos tags4 are adopted here. with a chunk of the other. During alignment, the following
assumption is considered: “a pair of chunk will be aligned if
The final step is classification, which consists of a token from each chunk is already aligned in the first step”.
two modules. The first module consists of a sequence Finally, all the unaligned chunks in a pair of sentences are
classification technique, known as sequential classification, taken for further alignment. For this purpose, a 1:M (multi)
which for the first input determines the most likely category chunk aligner is considered here, and in this step, multiple
tag and then uses it to find the finest label for the next chunks of a sentence are aligned with a chunk of another
input. This process is repeated until all input is labelled. sentence.
The second module converts the output labels to chunk
labels, such as NP, VP, PP, ADJP or ADVP. For classifier– 4.2.1 One–one (1:1) token alignment
based chunking, we have decided to use the Max Entropy
(MaxEnt) text classifier for the following reasons: A ready–to–use monolingual word aligner is adopted here
[49] and available for download.5 For the following reasons,
– In the literature, the MaxEnt classifier is already we have adopted monolingual word aligner for 1:1 token
used for text chunking and achieved state–of–the–art alignment:
performance [46].
– It works with one intuition such that if no sufficient – It is already tested over two aligner datasets, the (i)
evidence is available to favour one chunk over another, MSR alignment dataset [11] and (ii) Edinburgh++
both can be considered [26, 30]. corpus [51].
4 https://www.sketchengine.eu/penn-treebank-tagset/ 5 https://github.com/FerreroJeremy/monolingual-word-aligner
– The monolingual word aligner already reported [Hundreds] [fall] [sick] [in Bangladesh factory]
improved alignment results relative to other state–of–
the–art word aligners [12, 51, 55]. ii) When the sentences refer to two different events
and chunks are in different roles then also chunks
This monolingual word aligner works with a hypothesis
need to be aligned.
such that “if words are represented in a similar context, then
pair of words are possible candidates for alignment”. The
[Saudis] [to permit] [women] [to compete] [in
1:1 token aligner categorises a pair of words into any of the
Olympics]
following four groups:
[Women] [are confronting] [a glass ceiling]
– aligning of identical words
– aligning of named entities 4.2.2 One–multi (1:M) chunk alignment
– aligning of content words
– aligning of stop words A token–to–chunk multi-aligner method, which is proposed
and reported in [37] is adopted here. The multi–chunk
Further, each pair of words is aligned independently,
aligner works with a fundamental goal, leaving the
and contextual evidence between them helps to make the
minimum number of chunks unaligned. The following
alignment decision.
things are considered to move from a 1:1 token alignment to
Two sources, (i) syntactical dependencies and (ii) words
1:M chunk alignments:
occurring within a small textual vicinity, are considered to
get the contextual evidence. In addition to this, the semantic – If cs and ct are two chunks of a sentence pair s and t,
similarity score between a pair of words is also considered then both chunks are aligned to each other, if a token of
for alignment. Three levels of word similarity are measured a chunk of s is already aligned with a token of chunk of
for this 1:1 token aligner: t.
– The average Wu and Palmer similarity score between
– A similarity score of one (1) is used for alignment if a
the tokens of aligned chunk is also considered for 1:M
pair of words (with lemmatisation) are exactly the same.
alignment [54].
– A paraphrase score in the range (0, 1) is used as the
– Further, to align the unaligned chunks of s and t, the
similarity score to align non–identical word pairs. The
cosine similarity score between a pair of chunks with a
largest (XXXL) Paraphrase Database (PPDB)6 is used
threshold value (0.7 ≤ th ≥ 0.3) is considered for this
to extract all possible pairs of non–identical word pairs.
purpose.
– Finally, for any such pair that is not present in PPDB, a
– Finally, all the 1:1 alignments only based on stop words
similarity score of 0 is assigned.
are discarded if the chunk length is more than 1.
Further to align more number of chunks following
The chunks of each dataset have been aligned indepen-
alignment guidelines are considered:
dently, and for the 1:M chunk aligner, the cosine similarity
– If the meaning of the chunks are the same or related between a pair of chunks is one of the features. To get the
by considering the context and the interpretation of the correct cosine similarity between a pair of headlines dataset
corresponding sentence. chunks, Google pre–trained news vectors with dimension
300 are used here.7
[Red double decker bus] [driving] [through the streets]
[Double decker passenger bus] [driving] [with traffic] 4.2.3 Aligning of the image caption dataset
– Chunks are aligned if chunks play similar roles in an A word2vec model is trained, and image captions from
underlying event. There are two possible scenarios, the Microsoft COCO image dataset and SemEavl 2015
dataset are used here. The Microsoft COCO image dataset8
comprises image captions over 6 lakhs by combining
i) chunks are different but related in roles can be
the train (approx. 412529 captions) and validate (approx.
aligned.
201942 captions) datasets. The following parameter values
are considered here while training the word2vec model:
[Hundreds] [of Bangladesh clothes factory work-
ers ill] size – here, the dimension of each word vector is
taken as 350. It means that a maximum 350 word
7 https://code.google.com/archive/p/word2vec/
6 http://paraphrase.org/#/download 8 http://cocodataset.org/#download
G. Majumder et al.
embeddings will be considered when calculating the before assigning a similarity score, and examples of various
cosine similarity between a pair of words. alignment scores for an aligned pair are provided in Table 3.
window – here, the maximum distance between a target
– 5 is assigned if the meanings of both chunks are
word and words around the target is considered as 3.
equivalent.
min count – the minimum count of words is considered
– [4 or 3] if the chunks are very similar or closely related.
as three (3). In the corpus, any word that appears less
– [2 or 1] if the chunks are aligned by sharing minimum
than three times will be ignored.
information.
workers – the number of threads during training is three
– 0 is assigned if the chunks are completely unrelated to
(3).
each other and considered both the chunks as unaligned.
sg – the Continuous Bag of Words (CBOW) model is
used here to train the word embeddings. In this work, a supervised linear regression algorithm is
developed to assign an alignment score. A linear regression
algorithm is a linear approach to find the relationship
4.2.4 Aligning of the student answer dataset
between a scalar response (or dependent variable) y and one
or more explanatory variables (or independent variables)
Before aligning the student answer dataset, each sentence
x. For this work, we have adopted linear regression
pair is run over a pre–processing module. The following list
because alignment score y is directly proportional with the
of changes are made to represent the syntactical meaning of
alignment type. A regression model can be a simple (as
a pair of sentences into same level.
shown in (2)) or multiple (as shown in (3)) depending on the
number of independent variables (x’s) that are considered to
– If a line has A, B and C, which means Bulb A, B and C,
predict the similarity and relatedness score.
then the word Bulbs is added to the sentence.
sent1 – bulbs a, b, and c are on a path with the battery y = β0 + β 1 x (2)
sent2 – A and C are in the same path
output – Bulbs A and C are in the same path
y = β0 + β1 x 1 + β2 x 2 + · · · + βn x n (3)
– X, Y, and Z in a student answer means switch X, Y, and
Z.
sent1 – switch X does not effect bulbs B and C where y is the prediction and β0 , βi (i = 1, . . . , n) (in (2),
sent2 – There is a path with B and C that does not (3)) are the intercept and coefficient of x. n is the number of
include X features, and one or more features are taken to train a simple
output – There is a path with B and C that does not and multiple linear regression model.
include switch X To train the regression model, three types of features
– If numbers appeared alone in a student answer, then are extracted from a pair of aligned chunks. The feature
circuit is added before number. values range from 0 to 1 and can be categorised into three
sent1 – The battery is not in a closed path groups. For the first category, string similarity measures
sent2 – The battery in 2 is not in a closed path such as the Jaccard similarity coefficient, cosine similarity
output – The battery in circuit 2 is not in a closed path and common word ratio are considered. These methods
– In a student answer, paths can be ‘closed’ or ‘open’. But produce an alignment score by considering the count of
if answers have only paths, then it appears to be a closed shared and distinct words across the chunks.
path. The second group consists of distributional representa-
sent1 – Switch Y is not on the path of bulb A. tions of words as word vectors and measures the similarity
sent2 – There is no closed path containing both by considering the context of a word. Finally, the third group
switch Y and bulb A of methods measures the similarity between two tokens by
output – Switch Y is not on the closed path of bulb considering the lexical features, and WordNet9 is used for
A this purpose. These sets of features are as follows:
4.3.1 Jaccard similarity coefficient

4.3 Measuring of the similarity and relatedness score
To measure the similarity, first, chunks are tokenised.
The scoring module works independently of the classifi- Further, tokens of each chunk (C1 , C2 ) are compared to
cation module, and a score from 0 to 5 is assigned to an check the percentage of shared and distinct tokens across
alignment. Further, the following distribution is considered the chunks. It has been observed that, higher the percentage,
9 https://wordnet.princeton.edu/
chunks are more similar. 4.3.4 Word embeddings
|C1 ∩ C2 | Three different libraries of word embeddings are used to

J (C1 , C2 ) = (4)
|C1 ∪ C2 | measure the similarity score between two chunks. Word
embeddings are capable of capturing the context of a word
The Jaccard similarity score (J ) is the fraction of shared in a document and also provide semantic and syntactic
and distinct number of tokens in C1 and C2 , which is shown relations with other words. Word embeddings can be
in (4). obtained either using the skip gram [20] or common bag
of words (CBOW) [22] mechanisms. Further, the Euclidean
4.3.2 Cosine similarity distance (or cosine similarity) is used to measure the
linguistic or semantic similarity of the corresponding words.
It is a measure of similarity in terms of the cosine angle
– Google’s pre–trained Word2Vec includes word vectors
between two non–zero vectors in an inner product space.
for a 3–million–word vocabulary and phrases that they
First, to calculate the cosine similarity between two chunks
trained on a Google News dataset of approximately 100
(C1 , C2 ), we need to convert the chunks into a feature vector
billion words [39].
with a dimension equal to the number of tokens in C1 and
– Global Vectors (GloVe) is an unsupervised learning
C2 . For this, the count of word frequencies in each chunk is
algorithm and is another way to obtain the vector
considered as a feature vector. Equation (5) represents a dot
representations for words. Training is performed on
product between two vectors, where A and B represent
an aggregated global word–word co–occurrence matrix
the word wise count of chunks C1 and C2 .
with their frequencies in a corpus [42]. The GloVe wiki
gigaword–100 was trained over 6 billion words from
A · B = A B cosΘ (5) the Wikipedia 2014 news corpus with a vector length of
100 features.
Given two vectors of word frequencies, A and B, the – spaCy10 word vectors comprise 300-dimensional
cosine similarity, is measured using a dot product and GloVe vectors for over 1 million terms in English.
magnitude as follows: Based on the word embeddings, spaCy offers a simi-
n larity between token (word), span (set of consecutive
A·B Ai Bi words as chunk), document and lexical units.
cos (θ) = = i=1 (6)
A B n
A2 n 2
i=1 i i=1 Bi 4.3.5 Wu and Palmer (WuP) similarity
where Ai and Bi represent the components of A and B. It measures the relatedness score by taking into account the
The resulting similarity score ranges from −1 (chunks are depths of two synsets (syn1 , syn2 ) between the WordNet
completely opposite meaning) to 1 (means fully similar or taxonomies and the depth of the least common subsumer
equivalent), while in–between values indicate intermediate (LCS).
similarity or dissimilarity.
depth (lcs(syn1 , syn2 ))
W uPsimi = 2 ∗ (8)
4.3.3 Common word count (depth (syn1 ) + depth (syn2 ))
The similarity score (W uPsimi ) can be 0 <

The number of common words between two chunks is also
W uPsimi <= 1. The value can never be zero because the
taken into account and considered as a feature value, which
LCS depth is never zero, owing to the taxonomy root depth
is shown in (7).
being one. This measures the similarity based on the simi-
|wordsc1 ∩ wordsc2 | larity of the word senses in the hypernym tree, where the
Simiscore = (7) synsets are connected to each other [54].
0.5 ∗ lenc1 + lenc2
4.3.6 Path similarity
The numerator of (7) counts the number of common words
presents in a pair of chunks and len in denominator By counting the number of edges (l) along with the
calculates number of tokens in c1 and c2 , which represent shortest path between the senses (s1 , s2 ) of a word, the
respective chunks of two sentences. This feature value
ranges from 0 to 1, where 0 means no common word is
10 https://spacy.io/models
found and 1 means all chunks are fully similar.
G. Majumder et al.
path similarity considers the ‘is–a’ hierarchies of words to

calculate the semantic relatedness score between two words
[18].
1
path (s1 , s2 ) = (9)
l (s1 , s2 )
To measure the WuP and path relatedness score, we

have written three functions and the required pseudo code
is listed as Algorithms 1 – 3. First, aligned chunks are
processed for lemmatisation using LemmatizedChunk
(listed as Algorithm 1). In Algorithm 1, a chunk (C) is
given as input and it returns the lemmatised chunk lchunk as
output.
Thereafter, inflected chunks (lc1 , lc2 ) are tokenized and
assigned to tlc1 and tlc2 . Further, stop words are removed
from tlc1 and tlc2 . After the removal of stop words pos tags
of tlc1 and tlc2 are identified and saved as pos1 and pos2 . 4.4 Identification of the alignment types
These steps are part of the RelatednessScore algorithm
(listed as Algorithm 3), which is used to measure the A multiclass supervised classification algorithm is devel-
WuP and path relatedness score between the chunks. Next oped and used to assign an alignment type. The following
we retrieve the token-wise synsets using the GetSynset seven classification labels are used for this purpose:
algorithm (listed as Algorithm 2). Further, these lists of
– EQUI: the chunks are semantically equivalent and have
synsets are used to calculate the relatedness score.
the same meaning.
In Algorithm 3, pathScore and W uP Score are used
– OOPO: the meanings of the aligned chunks are opposite
to save the relatedness score between two synsets of
each other.
tokens of C1 and C2 respectively. maxP athScore and
– SPE1, SPE2: the chunks are similar, but the chunk in
maxW uP Score calculate the maximum score after each
sentence1 is more specific than that of sentence2, and
iteration.
vice versa.
In addition to these four exclusive labels, two other
labels, SIMI and REL, are used when chunks are close
or similar in meaning. These two labels are used in the
following manners:
SIMI – when the meanings of chunks are close or
similar, they share the same attributes and there is no
EQUI, OPPO, SPE1 or SPE2 relation.
Table 3 Example of aligned chunks with alignment scores. These

alignments scores are taken from annotated data
Chunk 1 Chunk 2 Score
on tree branches on tree branches 5

An Apple computer A Macintosh computer 5
A cat A black and white cat 4
A woman A young girl 4
standing grazing 3
in a field in a lush green field 3
in front of yellow flowers in a field 2
with another smiling woman laughing 2
Two mountain goats Two animals 1
REL – when chunks are not similar but are closely Table 4 Sample pair of aligned chunks with alignment type. These
related by some relation, which is not mentioned above. examples are taken from annotated data from the SemEval 2016 iSTS
task
NOALI – all the unaligned chunks of the sentence pair
are marked as NOALI and a chunk remains unaligned Chunk 1 Chunk 2 Alignent Type
when no corresponding chunk is found.
From Soccer from football EQUI
waiting sits SIMI
Red double decker bus Double decker passenger bus SPE1
through the streets with traffic REL
a women a young women SPE2
on the ground through the air OPPO
report deaths killed EQUI FACT
Gunmen Militants EQUI POL
event for VP chunks. In Table 4 an example of each

alignment type is listed.
A set of ten features is used to develop this classification
module. Out of these ten features, nine features are taken
from the scoring module discussed in Section 4.3. Along
with these nine features, a scaled prediction value of the
scoring module (discussed in Section 4.3 ranging from 1 to
5) is used. The following list of classifiers is used to train
the supervised classification algorithm:
4.4.1 Gaussian Naive Bayes
A class (Ck ) is predicted by considering the observation

value (v) on feature vector (xi ). Further, the probability
distribution of v for a given class Ck can be measured by the
following (10).
(v−μk )2
1 2σk2
p (xi = v|Ck ) = e (10)
2Πσk2
where μk and σk2 represents the mean and Bessel corrected

variance of the values in xi for a given class Ck . Gaussian
Naive Bayes works on the simple assumption that the
feature values associated with each class obey a Gaussian
distribution [21].
4.4.2 k–Nearest Neighbours (knn)
The Scikit–learn machine learning library is used to

implement the knn classifier, and the parameters listed in
Table 5 are considered for this purpose.
To decide a nearest neighbour, the auto algorithm will
Interpretation of the entire sentence, such as common- choose the appropriate one by considering the ball tree,
sense inference, is taken into account before assigning an kd tree or brute force searching algorithms. It will choose
alignment category. It means background of the chunks the algorithm by considering the values passed during the
being aligned to know whether the chunks being aligned training. The default leaf size is 30, and the Minkowski
refer to the same case. An instance can be a physical or distance metric with the Euclidean distance metric is
abstract object instance for NP chunks and a real-world considered here [41].
G. Majumder et al.
Table 5 List of parameters used to train the knn classifier Table 6 List of parameters used to train Support Vector Classifier
Parameter Type Parameter Values Parameter Type Parameter Values
number of neighbors 9 kernel polynomial

weights uniform degree 4
search algorithm auto regularization parameter 2
leaf size 30 kernel coefficient (gamma) scale
distance measure Euclidean shrinking heuristic true
distance metric Minkowski tolerance for stopping criteria 1e-3
size of the kernel cache 200 MB
maximum iteration no limit
4.4.3 Support Vector Machine (SVM) decision function one-vs-rest (over)
The LIVSVM11 library is used here to train the support

vector classifier, and parameters listed in Table 6 are student and the BEETLE II tutorial dialogue system (student
considered for this purpose [13]. Further, a discussion of the answer). A benefit of using this dataset is that the data
performance of proposed classification module is provided have already been processed with a large number of
in Section 5.5. computational natural language processing methods. So, we
can easily compare the results with other reported results.
4.5 Formation of the interpretability layer The headlines corpus is a collection of naturally occurring
news headlines from various news sources, gathered by the
This module has assimilated the output of previous modules Europe Media Monitor engine (from April 2nd, 2013 to July
and forms an interpretability layer, which is depicted in 28th, 2014) [14]. The image caption dataset is a subset of the
Fig. 5. The first segment shows a pair of input sentences PASCALVOC–2008 dataset [43], which consists of 1000
as source and translation. The next segment shows the images with ten descriptions each.
tokenised form of each sentence and numbered in increasing The student answer dataset consists of the interactions
order. At the last level chunking, alignment and alignment between students and the BEETLE II tutorial dialogue
type with alignment score are combined to form the system. The BEETLE II system is an intelligent tutoring
interpretable layer. engine, which teaches students about basic electricity and
In the alignment segment, numbers represent the token electronics [17]. A text–based chat interface is used to
numbers. The left and right numbers of the alignment (⇔) collect feedback from students, and 73 undergraduate
represent the token numbers of the source and translation students recorded their feedback. Further, student’s answers
sentences, respectively. These token numbers jointly or were annotated in the form of BEETLE human–computer
individually represent a chunk. During alignment, the third dialogue corpus. The student answer dataset is composed
token of the source text ‘detects’ remains unaligned, and the of a student’s answers along with an expert answer. These
‘NOALI’ alignment type is assigned to this chunk. datasets are freely available12 for experiment purpose.
The last segment also represents each alignment with its Table 7 reports statistics about these datasets. The student
type (marked as bold face) and alignment score. Out of six answer dataset contains a slightly higher number of chunks
chunks of source sentence two chunks (‘says’ and ‘Russia’) and tokens for every sentence and chunk, respectively. A
are aligned as ‘EQUI’ (means completely equivalent) type. higher percentage of chunks are aligned as EQUI and have a
Another chunk of source text ‘from Mediterranean Sea’ is score of 5 (across all three datasets). Less than 1% of aligned
aligned to ‘in Mediterranean’ of translation text as ‘SPE1’, chunks have a relatedness score of 1 (insufficient data for
as the chunk of source text is more specific than translation. training and testing). The first three rows of Table 7 report
the number of sentence pairs, tokens for every chunk and
chunks for every sentence.
5 Experiment analysis and results The next row reports the number of aligned pairs across
the datasets in training and testing files separately. The
5.1 Details about the datasets next segment of Table 7 shows the percentage of each
alignment score (ranges from 1 to 5). The last two segments
The dataset comprises pairs of sentences from news show the statistics of each alignment type along with
headlines, image captions and interactions between a their percentages. The last segment shows the number of
11 https://www.csie.ntu.edu.tw/∼cjlin/libsvm/ 12 http://alt.qcri.org/semeval2016/task2/index.php?id=data-and-tools
Fig. 5 Textual representation of

the interpretable layer using
tokeniszation, chunking,
alignment with type and score of
parallel sentences
Table 7 Details of news headlines (Headlines), image captions (Images) and student answer (Answer) datasets with the number of aligned chunks
and aligned types with score
Headlines Images Answer
Train Test All % Train Test All % Train Test All %
Sentence Pair 756 375 1131 – 750 375 1125 – 330 344 674 –
Chunks/sentence 4.17 4.30 4.26 – 4.53 4.86 4.70 – 4.31 4.15 4.23 –
Token/chunk 1.86 1.86 1.86 – 2.25 2.17 2.21 – 2.47 2.45 2.46 –
Aligned Pairs 2187 1171 3358 1913 1148 3061 925 939 1864
score ∈ 5 1324 685 2009 59.82% 1030 628 1658 54.15% 593 563 1156 62.02%
score ∈ 4 421 272 693 20.63% 516 313 829 27.08% 201 217 418 22.42%
score ∈ 3 264 149 413 12.29% 212 125 337 11.00% 98 109 207 11.11%
score ∈ 2 152 60 212 6.31% 128 79 207 6.76% 28 48 76 4.08%
score ∈ 1 26 5 31 0.90% 27 3 30 0.98% 5 2 7 0.38%
EQUI 1315 684 1999 59.52% 1029 624 1653 54.00% 595 564 1159 62.18%
SPE1 198 107 305 9.08% 245 150 395 12.90% 61 67 128 6.87%
SPE2 190 108 298 8.87% 229 152 381 12.45% 74 78 152 8.15%
SIMI 318 158 476 14.16% 340 185 525 17.15% 73 77 150 8.05%
REL 127 99 226 6.73% 67 32 99 3.23% 76 97 173 9.28%
OPPO 19 13 32 0.95% 3 1 4 0.13% 39 49 88 4.72%
NOALI 1779 869 2648 – 2892 1318 4210 – 930 908 1838 –
FACT 16 0 16 0.48% 0 0 0 0 0 0 0 0
POL 4 2 6 0.18% 0 0 0 0 0 0 0 0
G. Majumder et al.
unaligned (NOALI) chunks and also reports how many Chunking rule contains a personal pronoun (PRP$) when a
times a FACT and POL labels are used as an additional label chunking rule encounters a personal pronoun ({< I N ><
with other labels. Only the headlines dataset has FACT and P RP $ >< NNS >}), then the rule-based chunker marks
POL labels, and out of 3358 aligned pairs, only 16 and 6 the personal pronoun as a separated chunk. Example given
numbers of pairs are labelled with FACT and POL type. in Table 9, the incorrectly identified chunk (for the rule-
Due to the limited data, POL and FACT type labels are not based chunker) is marked as bold face, and ‘[’ and ‘]’
considered for training and testing. represent the beginning and ending of a chunk.
5.2 Evaluation of the chunking module Chunking rule has a noun and gerund together when the
NP ({< NN >< V BG >< NN >}) and VP ({< V BG >
The accuracy of the proposed chunking module, i.e., rule- }) chunking rules are applied successively over a sentence, a
and classifier-based, has been evaluated against the gold token having a ‘VBG’ pos tag is chunked with an NP chunk.
chunks provided with the dataset. The accuracy has been An example of Such incorrectly marked chunks are listed in
measured on the basis of four evaluation metrics, IOB Table 10.
Accuracy, precision (P), recall (R) and f1–measure (F1). In
addition, the sentence-level (SL) accuracy is also measured Chunking rules with a coordinating conjunction (CC) if a
by comparing the system chunks to gold chunks. chunking rule is preceded by a coordinating conjunction
The chunking accuracy based on IOB tags of the (CC) {< CC >< DT >< NN >}, then a chunk is divided
classifier-based chunker (reported in Table 8) is higher than into two chunks, such as in the given example listed in
that of the rule-based chunker. The IOB accuracy is 90.40%, Table 11, “with a black top”, and “and a necklace” marked
98.50% and 95.56% for the headlines, images and answer as two independent chunks.
datasets, respectively. When the comparison is performed at
the sentence level, then the classifier-based system chunks 5.3 Evaluation of the alignment module
outperforms gold standard chunks in all the three datasets,
and accuracies of 82.56%, 89.95% and 89.75% are recorded For aligning, we have considered both system (sys) and gold
respectively. standard (gs) chunks. The F1 alignment score using gs and
sys chunks is listed in Tables 12 and 13, respectively. Out
5.2.1 Error analysis of the three datasets, alignments on the headlines test data
using gs chunks attained the highest F1 accuracy of 0.9523.
This analysis has been performed based on the incorrectly In these two tables (see Tables 12 and 13), the first
identified chunks marked by the rule-based chunker. All row reports the accuracy of 1:1 chunk alignment (refer
the chunks returned by the rule-based chunker have been Section 4.2), and in the second and consecutive rows, we
compared to the gold standard chunks as well as the output have reported the results of 1:M chunk alignment. The
of the classifier-based chunker. A list of three chunking last row represents the alignment score of the combined
rules has been identified, in which the rule-based chunker output of the 1:1 aligner with all the features used for 1:M
fails to detect the correct chunk. For each type of rule, an aligner. For the answer dataset, we have reported the 1:M
example is listed with its gold standard, rule-based and aligner using stop words feature only because the results of
classifier-based chunks. The output of the classifier chunker alignment did not improve by adding 1:M aligner features
is listed in a format of three tuples, word, pos tag, and IOB with 1:1 alignment features. An accuracy of 0.8842 and
tag. 0.8233 is achieved using gs and sys chunks of test datasets.
Table 8 Results analysis of rule-based and classifier-based chunkers
Dataset Chunker Type IOB Accuracy P R F1 SL
headlines rule 90.30 73.10 89.60 77.01 75.76

classifier 90.40 85.40 81.50 83.40 82.56
images rule 93.90 87.00 88.80 87.89 71.45

classifier 98.50 91.60 91.20 91.40 89.85
answer rule 92.50 90.40 82.30 86.16 77.60

classifier 95.56 89.40 84.50 86.88 89.75
Table 9 Output of chunking module with chunks contains a personal pronoun pos tag
Chunk Type Sentence as a Chunk String
Gold Standard [The old lady] [is standing] [in the kitchen] [with two cats] [at her feet]
Rule-based [The old lady] [is standing] [in the kitchen] [with two cats] [at] [her] [feet]
Classifier-based [(The, DT, B-NP) (old, JJ, I-NP) (lady, NN, I-NP)]
[(is, VBZ, B-VP) (standing, VBG, I-VP)]
[(in, IN, B-PP) (the, DT, I-PP) (kitchen, NN, I-PP)]
[(with, IN, B-PP) (two, CD, I-PP) (cats, NNS, I-PP)]
[(at, IN, B-PP) (her, PRP$, I-PP) (feet, NNS, I-PP)]
5.4 Evaluation of the scoring module matched word ratio(x7 ) shares the maximum correlation
value of 0.62.
Acceptance of the scoring module is measured based on The Jaccard (x3 ) and cosine (x4 ) string similarity shares
various parameters. The Pearson correlation matrix is used the strong correlation value of 0.98. Out of three word
to find the relation between the features (xi ) and the embeddings, the GloVe (x2 ) and spaCy (x9 ) word vectors
alignment score (y). Further, four different models have share a highest correlation of 0.77 and Word2Vec (x1 ) shares
been trained and evaluated. Finally, the Mean Absolute 0.50 and 0.54 as correlation value with x2 and x3 . The WuP
Error (MAE), Mean Squared Error (MSE) and Root Mean (x5 ) and path (x6 ) lexical word similarity measures share a
Squared Error (RMSE) are used to get an intuitive sense of correlation value of 0.96.
the results.
5.4.2 Statistical significance of feature selection on the
5.4.1 Finding correlations between feature values basis of p –values
The Pearson correlation coefficient (r) is used

to measure The statistical Ordinary Least Square (OLS) regression
the correlation between two features xi , xj and each fea- method is accepted to find the p−value of each feature.
ture with an alignment score (xi , y). Table 14 lists the cor-
Features having a p−value of ≤ 0.05 are considered for
relation values between two independent variables xi , xj , training and testing. A p−value (≤ 0.05)also concludes
which ranges from (−1, 1). Values less than or greater than that, it rejects the null hypothesis with the predicted
zero represent negative and strong relations between two variable.13 Table 15 reports the corresponding p−value
variables, respectively. along with standard error (std. err.), t–statistics (t) and
The correlation matrix reflects that the, unmatched word coefficient of feature values.
ratio (x8 ) is indirectly related to the similarity score (y) For the headlines dataset, all feature values have p–
and shares a negative correlation of −0.62. It means that values of less than 0.05, and these are considered for
increasing the number of unmatched words in two aligned training and testing. For the image caption dataset, cosine
chunks will reduce the alignment score. On the other hand, string similarity, WuP similarity and spaCy word vectors
do not contribute to predict the correct similarity score.
Similarly, for the student answer dataset, features like GloVe
word vectors and the Jaccard string similarity are discarded
Table 10 Output of chunking module with chunks contains a noun and for training and testing.
gerund pos tag together
Chunk Type Sentence as a Chunk String 5.4.3 Results of the scoring module
Gold Standard [ Cow ] [ walking ] [ under the tree ] [ in a pasture ] The datasets discussed in Section 5.1 have been used to
Rule-based [ Cow walking ] [ under the tree ] [ in a pasture ] build the regression model. The regression algorithm has
been trained and evaluated in four different ways; these are
Classifier-based [(Cow, NN, B-NP)] listed below:
[(walking, NN, B-VP)]
[(under, IN, B-PP) (the, DT, I-PP), (tree, NN, I-PP)]
[(in, IN, B-PP) (a, DT, I-PP) (pasture, NN, I-PP)] 13 https://fivethirtyeight.com/features/not-even-scientists-can-easily-
explain-p-values/
G. Majumder et al.
Table 11 Output of chunking module with chunks contains a coordinating conjunction (CC) pos tag
Chunk Type Sentence as a Chunk String
Gold Standard [ A young woman ] [ with a black top and a necklace ]

Rule-based [ A young woman ] [ with a black top ] [ and a necklace ]
Classifier-based [(A, DT, B-NP) (young, JJ, I-NP) (woman, NN, I-NP)]
[(with, IN, B-PP) (a, DT, I-PP) (black, JJ, I-PP) (top, NN, I-NP)
(and, CC, I-NP) (a, DT, I-PP) (necklace, NN, I-PP)]
Table 12 F1 accuracy of the alignment module using gold standard (gs) chunks. For the student answer dataset, the 1:M aligner did not find WuP
and word2Vec similarity scores between the token of aligned chunks
Dataset Headlines Images Answer
Features Train Test Train Test Train Test
signle alignment 0.8529 0.8429 0.8769 0.8952 0.8742 0.8892

single alignment + with stop words 0.8299 0.8042 0.8463 0.8322 0.8597 0.8645
single alignment + without stop words 0.9045 0.9046 0.8957 0.8988 0.8842 0.8942
single alignment + word2vec 0.9089 0.9105 0.8967 0.8934 – –
single alignment + WuP Similarity 0.9242 0.9246 0.9125 0.9018 – –
single alignment + WuP Similarity + word2vec + without stop words 0.9423 0.9523 0.9423 0.9345 – –
Table 13 F1 accuracy of the alignment module using system (sys) chunks. For the student answer dataset, the 1:M aligner did not find WuP and
word2Vec similarity scores between the token of aligned chunks
Dataset Headlines Images Answer
Features Train Test Train Test Train Test
single alignment 0.8422 0.8369 0.8246 0.8142 0.8014 0.8142

single alignment + with stop words 0.8122 0.7942 0.8129 0.8041 0.7942 0.7985
single alignment + without stop words 0.8862 0.8942 0.8574 0.8342 0.8233 0.8264
single alignment + word2vec 0.8895 0.8846 0.8581 0.8485 – –
single alignment + WuP Similarity 0.8962 0.8864 0.8642 0.8584 – –
single alignment + WuP Similarity + word2vec + without stop words 0.8975 0.8942 0.8742 0.8592 – –
Table 14 Correlation values between all feature components with the similarity score. x1 – x9 represents the feature components (discussed in
Section 4.3)
Features x1 x2 x3 x4 x5 x6 x7 x8 x9 Similarty Score
x1 1.00 0.50 0.54 0.51 0.49 0.52 0.53 –0.53 0.55 0.41
x2 0.50 1.00 0.62 0.64 0.47 0.51 0.65 –0.65 0.77 0.39
x3 0.54 0.67 1.00 0.98 0.58 0.65 0.85 –0.85 0.78 0.56
x4 0.51 0.64 0.98 1.00 0.56 0.63 0.85 –0.85 0.79 0.54
x5 0.49 0.47 0.58 0.56 1.00 0.96 0.61 –0.61 0.52 0.41
x6 0.52 0.50 0.65 0.63 0.96 1.00 0.70 –0.70 0.58 0.48
x7 0.53 0.65 0.85 0.85 0.61 0.70 1.00 –1.00 0.77 0.62
x8 –0.53 –0.65 –0.85 –0.85 -0.61 –0.70 –1.00 1.00 –0.77 –0.62
x9 0.55 0.77 0.78 0.79 0.52 0.58 0.77 –0.77 1.00 0.53
Similarty Score 0.40 0.39 0.56 0.54 0.41 0.48 0.62 –0.62 0.53 1.00
Table 15 Impact of each feature component for prediction of alignment score on various datasets
Headlines Image captions Student answer
Features std. err. t p>|t| std. err. t p>|t| std. err. t p>|t|
x1 0.050 2.847 0.004 0.153 4.145 0.000 0.107 4.715 0.000

x2 0.130 –4.215 0.000 0.168 –9.700 0.000 0.071 –6.365 0.296
x3 0.210 7.175 0.000 0.204 3.236 0.000 0.253 1.045 0.432
x4 0.222 –7.322 0.000 0.227 –0.759 0.448 0.256 0.786 0.000
x5 0.163 –2.973 0.003 0.169 –0.218 0.827 0.327 –6.134 0.000
x6 0.169 3.228 0.001 0.179 2.569 0.010 0.325 6.905 0.000
x7 0.068 28.542 0.000 0.067 31.126 0.000 0.066 22.124 0.000
x8 0.039 19..093 0.000 0.048 18.799 0.000 0.041 28.895 0.000
x9 0.139 6.234 0.000 0.186 0.171 0.865 0.149 1.904 0.057
– Out of all the features listed in this Table 15, only one 5.4.4 Handling of the similarity and relatedness score
feature has been considered to build the simple linear of 4 and 3
regression model. During testing, the same feature has
been extracted from a pair of aligned chunks, and the Further to get the correct similarity or relatedness score with
similarity score has been predicted using the trained 3 and 4 a separate features set is measured and incorporated
regression model. with the scoring module to handle the chunks that are
– Next, the train test split model is adopted, and the closely related or very similar. These features are as follows:
regression model is trained by combining 80% training
– The meaning of the chunks are in opposite to each other
and test data. Further, for testing, only 20% of the data
then a relatedness score 4 is assigned. For example,
is considered.
in southern I raq ⇔ in northern I raq.
– Next, a multiple linear regression model is built by
– If the pair of aligned chunks of type prepo-
considering all features. The full training dataset is used
sitional phrases (PP) and the PP chunks have
for training, and for validation, the complete test dataset
the same head noun with different modifiers. For
is taken.
example, as legitimate representative ⇔
– At last, k–fold cross validation is considered to check
as sole representative with a score of 4.
the adaption of the proposed regression model.
– If the ration of number of nouns of aligned chunks
Further, the performance of the various regression varies and the alignment score is more than 0.7, then
models (listed in Table 16) is evaluated using three 3 is considered as a relatedness score (e.g. P ro −
regression metrics. Out of three datasets, the multivariate Russia rebels ⇔ Syria rebels).
linear regression model achieve the minimum MAE and – If the chunk alignment score is in between 0 and 0.7 and
MSE of 0.42 and 0.41, respectively, for the headlines the alignments are of type PP then a score 3 is assigned
dataset. For the image caption and student answer datasets, (e.g. in Canada ⇔ near Baltimore).
the multivariate model achieved 0.45 and 0.50, respectively, – If chunks contain any number and their values
for the MSE. are varied then relatedness score depends on their
Table 16 Results of various regression models on the headlines, images and answer datasets. For simple regression, the average error score is
reported
Model Headlines Image captions Student answer
MAE MSE RMSE MAE MSE RMSE MAE MSE RMSE
Simple 0.44 0.43 0.66 0.47 0.45 0.67 0.56 0.58 0.76
train–test split 0.45 0.50 0.71 0.47 0.48 0.69 0.52 0.51 0.71
Multivariate 0.42 0.41 0.64 0.46 0.45 0.67 0.53 0.50 0.71
k-fold 0.47 0.53 0.73 0.46 0.48 0.69 0.51 0.47 0.69
G. Majumder et al.
differences. Such as i) if the difference is between 7 and 337 aligned pairs, it identifies 121 times correctly, which is
10 then score 3 is assigned (for example, 11 ⇔ 16), 3 and 6 times better than Naive Bayes and SVC.
and ii) if the difference is less than 7 then 4 is assigned
as relatedness score (for example, 19 F iref ighters ⇔ 5.5.2 Statistical evaluation of the alignment types
19 H otshots).
Further, precision (P), recall (R), accuracy (A) and F1–
5.5 Evaluation of the classification module measure (F1)
scores are calculated
by considering the true
positive tp , false positive fp , true negative (tn ), and
The performance of the proposed multiclass classifier is false negative (fn ) statistical values for each class type.
measured on the basis of the precision, recall, accuracy These statistical values are measured in the following ways:
and F–measure scores for each classification type. For this
– tp is the positive observation for each class, which
purpose, the confusion matrix is used, which calculates the
means that the predicted class type is the same as the
true predicted values for each class type. A confusion matrix
actual class type. All the diagonal values of Table 17
helps to visualise the performance of a classifier on test data
represent the true positive value of each class type. The
(for which the true class labels are known). The unaligned
tp value of EQUI class for the naı̈ve Bayes classifier is
chunks of type ‘NOALI’ and the chunks pair labelled with
475, 564 and 515 for the Headlines, Images and Answer
POL and FACT are not considered for this evaluation.
datasets, respectively.
– fp for a class is the sum of values in the corresponding
5.5.1 Evaluation of the alignment module using the
column, excluding tp of each class. The fp for EUQI
confusion matrix
using SVC is 180.
– fn for a class is the sum of all predicted values for which
The performance of the classifier on the basis of the
the actual class is different.
confusion matrix is listed in Table 17. Each segment of
– tn for a class will be the sum of all predicted values
Table 17 represents a confusion matrix of the classifier
excluding the corresponding row and column.
on a particular dataset. The second row and column show
the actual and predicted class types for various datasets. A Further, these statistical values are used to calculate the
higher percentage of aligned chunks has been labelled as the average weighted value as true instances for each class type.
EQUI type by each classifier. Out of 1871 pairs of EQUI
– Precision is a ratio of tp of each predicted class over
chunks, 1688 aligned chunks have been equally identified
the sum of tp and fp . For classification, it tells the
by SVC and KNN. On the other side, Naive Bayes only
percentage of the relevant result.
identified 1554 pair of chunks as the EUQI type.
– Recall is a ratio of tp over the sum of tp and fn . It
The Naive Bayes classifier performs better while
tells the percentage of total correct classification by the
identifying the OPPO and REL alignment type. Out of 63
algorithm.
OPPO and 228 REL type, Naive Bayes has recognised
– Accuracy is an average of correctly predicted observa-
51 and 91 numbers of aligned chunks as OPPO and REL
tions over the total observations.
successfully. For the same type, SVC and KNN performs
– F–Measure considers both fp and fn and produces a
badly, and only 15 and 18 of the OPPO alignment type
weighted average of Precision and Recall.
have been identified successfully. For the REL type, KNN
recognises 80 times correctly, which is 15 times more than The results of the classifiers on the various testing
SVC. datasets are listed in Table 18. The classification score
SVC performs better while recognising the SIMI of SVC with respect to precision is 0.85, 0.79 and 0.80,
alignment type, and it identifies 263 out of 420 numbers which is higher compared to other classifiers. The recall and
of SIMI type. On the contrary, Naive Bayes performs very accuracy score are 0.80 for the Headlines dataset for SVC,
badly, and it obtains success 93 times only. Compared which is 7% and 2% more than Naı̈ve Bayes and KNN,
to Naive Bayes, the KNN classifier identifies a greater respectively.
number of SIMI alignment types accurately, but it cannot Out of these three datasets, Naive Bayes performs well on
outperform the SVM. the Headlines dataset, and 0.77 is the highest precision score
SVC also outperforms KNN and Naive Bayes while across the three datasets. For this classification task, data
identifying the SPE1 alignment type. It identifies 144 times such as news headlines, image captions and student answers
(out of 324) SPE1 correctly, which is 22 and 83 times more are taken into account. From the database statistics reported
than KNN and Naive Bayes respectively. However, KNN in Table 7, it is very clear that datasets are imbalanced
performs slightly better than SVM for SPE2 type. Out of with respect to classification type (because more than 50%
Table 17 Details of the confusion matrices of various classifiers on the Headlines, Images and Student datasets
Classifier Predicted EQUI OPPO REL SIMI SPE1 SPE2 EQUI OPPO REL SIMI SPE1 SPE2 EQUI OPPO REL SIMI SPE1 SPE2
Actual
Naı̈ve Bayes EQUI 475 5 91 28 35 49 564 0 27 28 3 2 515 0 24 25 0 0
OPPO 4 2 2 4 1 0 0 1 0 0 0 0 0 48 1 0 0 0
REL 36 2 37 5 9 10 0 17 8 5 2 0 0 45 46 4 1 1
SIMI 52 4 56 22 14 10 0 31 42 63 7 42 0 41 25 8 1 2
SPE1 35 2 20 8 25 17 0 22 10 19 32 67 0 27 11 4 4 21
SPE2 40 3 15 6 10 33 1 33 19 18 9 72 0 42 12 6 6 12
SVC EQUI 501 0 1 109 47 25 624 0 0 0 0 0 563 0 0 1 0 0

OPPO 5 0 0 7 1 0 0 0 0 1 0 0 0 15 2 17 10 5
REL 38 0 1 47 10 3 0 0 0 29 2 1 0 2 64 15 4 12
SIMI 54 0 0 89 8 7 0 0 0 140 22 23 0 6 22 34 6 9
SPE1 41 0 0 26 33 7 1 0 0 56 86 7 0 2 8 4 25 28
SPE2 42 0 0 27 15 23 1 0 0 75 24 52 0 3 8 10 17 40
KNN EQUI 501 1 25 82 52 22 624 0 0 0 0 0 563 0 1 0 0 0

OPPO 5 1 0 5 1 1 0 0 0 1 0 0 0 17 8 12 9 3
REL 38 0 13 34 13 1 0 0 2 18 6 6 0 3 65 20 2 7
SIMI 54 0 18 66 12 8 0 0 18 126 24 17 0 8 31 27 5 6
SPE1 41 1 3 24 33 5 1 0 5 55 62 27 0 2 13 2 27 23
SPE2 42 1 5 23 17 19 1 0 7 60 23 61 0 5 14 5 13 41
G. Majumder et al.
Table 18 Overall test result of various classifier on the Headlines, Images and Answer datasets
Classifier P R Accuracy F1 P R Accuracy F1 P R Accuracy F1
Naı̈ve Bayes 0.77 0.73 0.73 0.75 0.75 0.65 0.65 0.70 0.72 0.68 0.68 0.70
SVC 0.85 0.80 0.80 0.82 0.79 0.79 0.79 0.79 0.80 0.80 0.80 0.80
KNN 0.78 0.78 0.78 0.78 0.77 0.76 0.76 0.76 0.79 0.79 0.79 0.79
are the EQUI aligned type). To measure the performance of similarity score between a pair chunk using Mikolov word
of the classifier, the F1–measure can provide a statistical vectors were used to assign the similarity score [39].
significance to the result. So, based on the F1–measure, Banjade et al. [9] performed the chunking using a
it is concluded that SVC has correctly identified a greater Conditional Random Field (CRF)-based chunker to produce
number of chunks successfully, as 0.79 is the F1 score system chunks. Additionally, three rules were used to
on the image caption dataset. For student answer dataset, combine system chunks to satisfy the requirement of gold
both KNN and SVC perform equally and their F1–score standard chunks. Thereafter, system-level chunks were
for classification is 0.79 and 0.80, respectively. For news compared to gold standard files, and an accuracy of 86.20%
headlines dataset SVC perform better than KNN and Naı̈ve and 63.34% was reported at the sentence level and chunk
Bayes with a classification score 0.82. level, respectively. The NeRoSim model was adopted to
align chunks and to assign relation type with score.
A deep learning-based classification model was devel-
6 Comparison with other iSTS oped by Magnolini et al. to predict the chunk–chunk align-
state-of-the-art methods ment, alignment type and similarity score [35]. A set of
three classifiers with multitask MLP was used to develop
The proposed method is compared to the top-performing the classification model. In total, 245 features from a pair
systems of 2015 and 2016 for the SemEval iSTS task [6], of chunks were extracted to train the classifiers. A distribu-
[5]. The word alignment evaluation measures (F1 score tional word representation with 100 dimensions from each
of precision and recall of token alignments) are adopted chunk was used to calculate the similarity score for a pair of
here to compare the proposed iSTS method with other aligned chunks.
state-of-the-art methods. UWB, reported in [31], divided the iSTS task into three
Hänig et al. [23] used the OpenNLP chunker to produce classification and regression tasks. A binary classification
system chunks. For alignment, words were categorised into algorithm was developed for alignment, and the alignment
different groups, and each type was aligned independently score was measures using both classification and regres-
in chronological order such as named entities, temporal sion algorithms. Finally, for type classification, another
exp., measurement exp., token sequence, negation, and classification algorithm was used.
remaining content words. For alignment type, various sets of Further, the NeRoSim method [10] was extended by
features were identified, and using the same set of features, Kazmi et al. (reported as Inspire [29]), and based on
a regression model was trained to assign the similarity score. logic programming, an iSTS method was developed. The
Karumuri et al., [28] used handmade rules to improve chunking was based on a joint approach of pos tagger and
the system chunks identified with the OpenNLP chunker. dependency parser with answer set programming (ASP) to
Further chunks were aligned using a monolingual word identify the chunk boundary. For alignment, an ASP solver
aligner [49], and a supervised classification model was with features like word2Vec, pos tags and alignment rules
trained to identify an alignment type. Average scoring for adopted by NeRoSim was adopted.
each alignment type was used, which was calculated from IISCNLP [50] developed the iMATCH algorithm, and
gold standard files. integer linear programming was used to combine non–
A system called NeRoSim proposed by Banjade et al. contiguous chunks. For alignment type and score, a
participated in the gold standard chunks category and used multiclass classifier with random forest was used.
a rule-based alignment method to align the chunks [10]. Lopez-Gazpio et al. [34] developed an iSTS method
The system also identified various semantic relations, like and reported two case studies as an application of iSTS
synonyms, antonyms and hypernyms of a word with a method. In compared to this, we have used the modules of
lookup mechanism during chunk alignment. Several ranges the proposed method for a textual entailment recognition
system. Lopez-Gazpio et al. [34] did not developed any comparative results related to the news headlines, image
chunker to produce the chunks of the sentences and the captions and student answer are listed in Tables 19, 20
results are reported in gold standard chunks only. For and 21, respectively.Out of 5 state–of–art methods reported
alignment, they have used a ready made mono-lingual word for headlines and image captions, the first three methods
aligner and in comparison to this we have used a token participated in the 2015 pilot iSTS task [6], in which
to chunk multi aligner, which is developed by the authors chunk alignments were restricted to 1:1 alignment. Next,
of the proposed work and reported in [37]. For alignment five methods participated in the 2016 iSTS task with an
scoring, they have used a rule based method and we have extension of one–multi (1:M) chunks alignment.
developed an supervised regression model to predict the In the iSTS pilot task, NeRoSim obtained the first
alignment score. rank (in the gold standard scenario) for the headlines
Majumder et al. [36, 37], the proposers of the proposed dataset, with F1 accuracy of 0.8984, 0.6666, 0.8263, and
method, developed two other methods and tried to compete 0.6426 for alignment, type, score and type with score. The
with the top-performing systems of SemEval 2015 [6] basic system, reported by Majumder et al., [36], achieved
and 2016 [5]. First, a baseline method was developed and alignment scores of 0.8500 and 0.7900 on the gold standard
with the headlines dataset only. In the first work, the and system chunks categories.
main goal was to develop a rule-based chunker and an In [37], a token to multi chunk aligner was proposed
aligner. For the aligner, various lexical and semantic features by Majumder et al. and tested on the headlines and
with cosine similarity between Mikolov word vectors were image caption datasets; improved alignment scores of 0.95
used. Further, for classification, a rule-based method was and 0.93 were reported. Subsequently, this token-to-multi-
developed and the average cosine similarity score between chunks alignment method also reported the higher accuracy
tokens of aligned chunks was used for the similarity score. for alignment type score and type with score.
In the second method [37], the fundamental goal was
to improve the accuracy of aligner, and for that purpose, a
token to chunk 1:M aligner was developed and tested over 7 Application of interpretable STS
the headlines and image caption datasets in the system and
gold chunk category. In order to judge the usefulness of the chunking and
In the proposed method, we have tried to improve the alignment module of the proposed iSTS method, we have
results of chunking and assigning alignment type and score. developed a Textual Entailment (TE) method using the
To improve the chunking accuracy, a Maximum Entropy features extracted at chunk and sentence level. Thereafter,
classifier is used and tested over theheadlines, images alignment score between a pair of aligned chunks is also
and student–answer datasets. Sentence-level chunking considered as a feature for the entailment task. In NLP,
accuracies of 82.56%, 89.85%, and 89.75% are achieved recognition textual entailment is a task of determining the
for the respective datasets. Further, we have represented the meaning of two text expressions, whether the meaning of
proposed iSTS problem as a classification and regression one text (the entailing text (T )) can be inferred from another
task. For identifying an alignment type, a supervised text (the hypothesis text (H )) or not. In this section we have
classification algorithm is developed using the Naive used the chunking and alignment module of the proposed
Bayes,knn, and support vector machine classifiers. Finally, iSTS method, to extract the features from T − H pairs.
a multi-variable regression algorithm is developed to To implement a TE method based on iSTS, no need to
calculate the alignment score. implement all the modules of iSTS. Instead of this we
To evaluate a ready–made perl evaluation script, which can use the chunking and alignment module as a feature
is available with gold standard data is used here to produce extractor to develop a TE method. Implementing a TE
four types of F1–accuracy evaluation matrices as follows: method using the chunk and alignment features can provide
a better accuracy to the task of judgement, where a system
– Independently, the F1 score of the alignment module is
can infer the meaning of T − H pairs are same or not.
measured.
– The F1 score of chunk alignment with type is measured,
7.1 About the textual entailment dataset
and pairs with different alignment types is are ignored.
– The F1 score of chunk alignment with score is
The dataset of sentence pairs (T − H ), collected by
measured.
human annotators, and used in Pascal RTE1 textual
– The F1 score of chunk alignment with type and score is
entailment recognising task.14 The human annotators
measured.
selected examples of both positive entailment (as TURE)
The proposed method outperforms all the state–of–
the–art methods in all modules of an iSTS method. The 14 https://tac.nist.gov/data/RTE/index.html
G. Majumder et al.
Table 19 F1–accuracy of alignment, type, score and type with score over the headlines test dataset
Year Method Chunk Type F1 + Alignment F1 + Type F1 + Score F1 Type + Score
2015 ExB Themis [23] gs 0.8146 0.4943 0.7171 0.4885

sys 0.7032 0.4331 0.6224 0.4290
UMDuluth- Blue Team [28] gs 0.7820 0.5154 0.7024 0.5098
sys 0.8861 0.5962 0.7960 0.5887
NeRoSim [10] gs 0.8984 0.6666 0.8263 0.6426
DTSim [9] gs 0.9070 0.6730 0.8320 0.6630
sys 0.8380 0.5610 0.7600 0.5470
FBK-HLT- NLP [35] gs 0.8994 0.7031 0.7865 0.6960
sys 0.8366 0.5605 0.7595 0.5467
2016 UWB [31] gs 0.8900 0.6150 0.8150 0.6080
Inspire [29] gs 0.8190 0.7030 0.7870 0.6960
sys 0.7040 0.5260 0.6590 0.5200
IISCNLP [50] gs 0.9130 0.5760 0.8290 0.5560
sys 0.8210 0.5080 0.7400 0.4290
2017 Lopez et al. [34] gs 0.8990 0.6400 0.8210 0.6190
2018 lexical features + cosine similarity [36] gs 0.8500 0.6200 0.8100 0.6500
sys 0.7900 0.5900 0.7700 0.6500
2016 token to chunk multi aligner [37] gs 0.9500 0.7100 0.8500 0.7000
– Proposed Method gs 0.9523 0.7700 0.9250 0.7346
sys 0.8975 0.6500 0.8760 0.5842
and negative means (as FALSE), where entailment does not and test datasets are comprises of 576 and 800 T -H pairs
hold. Some examples of T -H pairs listed in Table 22 and the respectively. These datasets have equal numbers of TRUE
dataset further, divided in train and test datasets. The train and FALSE entailment.
Table 20 F1–accuracy of alignment, type, score and type with score over the image caption test dataset
Year Method Chunk Type F1 + Ali F1 + Type F1 + Score F1 Type + Score
2015 ExB Themis [23] gs 0.8057 0.4413 0.7007 0.4296

sys 0.6966 0.3970 0.6068 0.3870
UMDuluth- Blue Team [28] gs 0.8336 0.5759 0.7511 0.5634
sys 0.8853 0.6095 0.7968 0.5964
NeRoSim [10] gs 0.8870 0.6143 0.7877 0.5841
2016 DTSim [9] gs 0.8770 0.6680 0.8160 0.6480
sys 0.8430 0.6280 0.7810 0.6100
FBK-HLT- NLP [35] gs 0.8922 0.6867 0.8404 0.6708
sys 0.8429 0.6276 0.7813 0.6095
UWB [31] gs 0.8940 0.6880 0.8410 0.6710
Inspire [29] gs 0.8670 0.6140 0.7950 0.6130
sys 0.8170 0.5430 0.7420 0.5360
IISCNLP [50] gs 0.8930 0.5250 0.8230 0.5090
sys 0.8460 0.4990 0.7770 0.4870
2017 Lopez et al. [34] gs 0.8850 0.6560 0.8090 0.6160
2019 token to chunk multi aligner gs 0.9300 0.7200 0.8800 0.6800
– Proposed Method gs 0.9345 0.7800 0.9380 0.7356
sys 0.8592 0.6700 0.8920 0.5845
Table 21 F1–accuracy of alignment, type, score and type with score over the student answer test dataset
Year Method Chunk Type F1 + Ali F1 + Type F1 + Score F1 Type + Score
2016 DTSim [9] gs 0.8610 0.5550 0.7810 0.5460

sys 0.8180 0.5160 0.7370 0.5070
FBK-HLT- NLP [35] gs 0.8684 0.6511 0.8245 0.6385
sys 0.8166 0.5613 0.7574 0.5547
UWB [31] gs 0.8590 0.6170 0.8040 0.6110
Inspire [29] gs 0.8210 0.5130 0.7440 0.5100
sys 0.7620 0.4550 0.6700 0.4520
IISCNLP [50] gs 0.8930 0.5250 0.8230 0.5090
sys 0.8460 0.4990 0.7770 0.4870
Proposed Method gs 0.8942 0.7900 0.9420 0.6942
sys 0.8464 0.6800 0.9050 0.6445
7.2 An iSTS based Textual Entailment System are recognised and T -H sentence pairs are converted to
lowercase sentences. The SpaCy NLP module is used for
In this section, we have presented an iSTS-based textual this purpose. Finally, all the numbers such as 5-billion
entailment recognition system. The overall process of the converted to text as five billion.
TE system is shown in Fig. 6. The complete system is
divided into training and testing phases. Each phase consists 7.2.2 Feature extraction
of four modules: preprocessing, chunking, alignment,
and feature extraction. The training phase includes one Sentence level and chunk alignment level features are
supervised binary classification algorithm to train a model, extracted from the sentences to train the entailment model.
accessing the textual entailment recognition decision as The details about the chunk level (CL) features are
T RU E and FALSE. In this method, we have used discussed in Section 4.3. In addition to this feature set
Naive Bayes classification algorithm, with the parameters a chunk alignment score is also calculated. The chunk
described in Section 4.4. alignment score is a fraction between the number of aligned
After preprocessing, sentences are segmented using and total number of chunks in T and H as shown in (11).
the classifier-based chunker discussed in Section 4.1.2.
Thereafter, chunks of T and H are aligned to each other. The
alignment module discussed in Section 4.2 is used for this 2 ∗ number of chunk pair aligned
alignscore =
purpose. The preprocessing and feature extraction modules T otal number of chunks in T and H
are discussed in Sections 7.2.1 and 7.2.2, respectively. (11)
7.2.1 Preprocessing
Secondly, before extract the sentence level (SL) features,
The input sentences contains any special characters such stop words are removed from T -H pairs and following list
as, comma, dash etc., and non-valuable information are of features are extracted:
preprocessed. At first, the preprocessing module has – At first, word overlapping score between T and H is
removed them and all the currency symbols (e.g., $) calculated.
are replaced with the respective text and a pre–defined – Thereafter, named entities overlapping score between
list is used for this purpose. Secondly, named entities text and hypothesis is also measured.
Table 22 Examples of the T -H sentence pairs of RTE-1 dataset
Text Hypothesis Entailment
Oracle had fought to keep the forms from being released Oracle released a confidential document FALSE
Phish disbands after a final concert in Vermont on Aug. 15 Rock band Phish holds final concert in Vermont TRUE
Satomi Mitarai died of blood loss. Satomi Mitarai bled to death. TRUE
Jennifer Hawkins is the 21-year-old beauty queen from Australia. Jennifer Hawkins is Australia’s 20-year-old beauty queen. FALSE
G. Majumder et al.
Fig. 6 The working flow of

Know
iSTS based TE recognition Training Phase
Classification
Supervised
system sentence pairs Chunking Aligning of Feature Extaction
Binary
label
Preprocessing <CT, CH> <CT <=> CH> Sentence Level +
<T, H> Classification
Alignment level
Algorithm
Testing Phase
Chunking Aligning of Feature Extaction

sentence pairs The trained
Preprocessing <CT, CH> <CT <=> CH> Sentence Level +
<T, H> Model
Alignment level
Prediction
<TRUE, FALSE>
– Finally, a ratio between number of missing words and have trained and tested the textual entailment system four
the named entities of T and H is also measured. times, by considering various feature sets. For TE system
we have extracted two types of features: chunk level
7.3 Results (CL) and sentence level (SL). Models trained with CL
features reported lowest accuracy as 52.54% and 51.25% in
The sentence pairs of RTE-1 dataset is equally balanced development and test dataset respectively. When chunk level
between the positive (TRUE) and negative (FALSE) features are combined with alignment features (as discussed
entailment. Two types of evaluation measures were used in Section 4.2), the system results significantly improves to
to judge the performance of the submitted systems for 64.56% and 60.34% respectively.
PASCAL RTE1 challenge task [16].
– The raw precision score, which is measured by
8 Conclusion
comparing the system generated judgements with
the gold standard. The percentage of the matching
The proposed work presents an interpretability semantic
judgements provides the accuracy score.
textual similarity method and formalises an interpretability
– The Confidence – Weighted Score (CWS, knows as
layer with chunking, alignment and an alignment score. We
average precision) was also measured. This confidence
have formulated the interpretability layer as a collective
score was used to check the confidence about the
approach of four modules. At first, a pair of sentences is
judgements of the submitted systems, which is varies
divided into segments called chunks, and further chunks
between 0 (means wrong judgement) and 1 (correct
are aligned to each other. Each alignment is labelled with
classification). Before calculation of the confidence
an alignment type and an alignment score. The alignment
score, judgement of the test dataset is sorted in desce-
type represents a relation between a pair of chunks, such as
ding order and the required formula is shown in (12).
opposition, equivalence, similar, related and specificity.
We have developed a supervised multiclass classification
1 #correct − upto − i th − rank
n
cws = (12) algorithm to assign an alignment type. To build this classi-
n i fier, a set of ten features is extracted from a pair of aligned
i=1
chunks. Additionally, we have developed a multiple linear
where n represents the number of pairs of the test dataset
regression algorithm to assign an alignment score to the
with range i over the sorted pairs.
alignment. Finally, we have incorporated the output of indi-
Results of the iSTS-based textual entailment recognition
vidual modules and produced an interpretability layer with
system are reported in Table 23, in which the Confidence-
the similarities and differences between two small text
Weighted Score is represented as a fraction of 100. We
snippets.
Table 23 Performance of the proposed iSTS-based Textual Entailment Recognition System on the RTE1 development and test dataset
Features RTE1 Dev Set RTE1 Test Set
Acc CWS Acc CWS
Chunk Level (CL) 52.54% 54.65% 51.25% 57.62%

CL + alignment score 64.56% 65.23% 60.34% 63.45%
CL + Sentence Level (SL) 57.78% 60.23% 59.23% 62.56%
CL + SL + alignment score 65.45% 67.23% 60.25% 63.67%
The presented work has been evaluated on three datasets Semantic textual similarity, english, spanish and pilot on inter-
of distinct domain and compared to other state–of–the–art pretability. In: Proceedings of the 9th International Workshop
on Semantic Evaluation, SemEval’15. Association for Computa-
interpretable semantic similarity methods. The accuracy of
tional Linguistics, Denver, Colorado, pp 252–263. https://doi.org/
the proposed method shows that it outperforms all state–of– 10.18653/v1/s15-2045
the–art methods. Further, to judge the usefulness of the iSTS 7. Aleven V, Popescu O, Koedinger KR (2001) Pedagogical
method, we have formalized a textual entailment method content knowledge in a tutorial dialogue system to support self-
explanation. Working Notes of the AIED 2001 Workhop Tutorial
based on the features extracted at chunk and sentence level
Dialogue Systems
of the entailing (T) and hypothesis (H) text. In addition 8. Baker RS, Corbett AT, Koedinger KR (2004) Detecting student
to this, the alignment score between the chunks of T and misuse of intelligent tutoring systems. In: Lester JC, Vicari RM,
H is also considered as a feature. Further, a Naive Bayes Paraguaċu F (eds) Intelligent tutoring systems. Springer, Berlin,
pp 531–540
binary classification algorithm is used with the same set of
9. Banjade R, Maharjan N, Niraula NB, Rus V (2016) Dtsim at
parameters used for the identification of alignment types semeval-2016 task 2: Interpreting similarity of texts based on auto-
for iSTS method. The Pascal RTE-1 textual entailment mated chunking, chunk alignment and semantic relation predic-
dataset is used for the experiment purposes and the results tion. In: Proceedings of the 10th international workshop on seman-
tic evaluation (SemEval-2016). Association for Computational
also shows that performance of the entailment is improved
Linguistics, pp 809–813. https://doi.org/10.18653/v1/S16-1125.
when we have combined the alignment information with the http://www.aclweb.org/anthology/S16-1125
chunk and sentence level features. 10. Banjade R, Niraula NB, Maharjan N, Rus V, Stefanescu D,
Lintean M, Gautam D (2015) Nerosim: A system for measuring
and interpreting semantic textual similarity. In: Proceedings
Acknowledgments The work presented here falls under the Research of the 9th international workshop on semantic evaluation
Project Grant No. IFC/4130/DST-NRS/2018-19/IT25 (DST-CNRS (SemEval 2015). ACL, Denver, pp 164–171. http://www.aclweb.
targeted program). The First author is also thankful to the Google org/anthology/S15-2030
Colab product for providing the supports for experiments and also 11. Brockett C (2007) Aligning the rte 2006 corpus. Microsoft
wants to acknowledge the Introduction to Machine Learning course of Research Technical Report, MSR–TR–2007–77
SWYAM NPTEL program of Govt. of India. 12. Chambers N, Cer D, Grenager T, Hall D, Kiddon C, MacCartney
B, de Marneffe MC, Ramage D, Yeh E, Manning C (2007)
Learning alignments and leveraging natural logic. In: Proceedings
References of the ACL–PASCAL workshop on textual entailment and
paraphrasing. Association for Computational Linguistics, Prague,
pp 165–170. https://www.aclweb.org/anthology/W07-1427
1. Abney SP (1992) Parsing by chunks. In: Berwick RC, Abney 13. Chang CC, Lin CJ (2011) Libsvm: A library for support
SP, Tenny C (eds) Principle–Based parsing: computation and vector machines. ACM Trans. Intell. Syst. Technol. 2(3):1–27.
psycholinguistics, vol 44. Springer, Dordrecht, pp 257–278 https://doi.org/10.1145/1961189.1961199
2. Agirre E, Banea C, Cardie C, Cer D, Diab M, Gonzalez-Agirre 14. Clive B, van der Goot E, Blackler K, Garcia T, Horby D (2005)
A, Guo W, Mihalcea R, Rigau G, Wiebe J (2014) Semeval-2014 Europe media monitor–system description. EUR Report 22173-En
task 10: Multilingual semantic textual similarity. In: Proceedings 15. Coelho TAS, Calado PP, Souza LV, Ribeiro-Neto B, Muntz
of the 8th international workshop on semantic evaluation, R (2004) Image retrieval using multiple evidence ranking. In:
SemEval’14. Association for Computational Linguistics, Dublin, IEEE transactions on knowledge and data engineering, vol 16.
Ireland, pp 81–91 IEEE Educational Activities Department, Piscataway, pp 408–
3. Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W (2013) 417. https://doi.org/10.1109/TKDE.2004.1269666
*sem 2013 shared task: Semantic textual similarity. In: Second 16. Dagan I, Glickman O, Magnini B (2006) The PASCAL recognis-
joint conference on lexical and computational semantics (*SEM), ing textual entailment challenge. In: Machine learning challenges.
Volume 1: Proceedings of the Main Conference and the Shared Evaluating predictive uncertainty, visual object classification, and
Task, SemEval’ 13. Association for Computational Linguistics, recognising tectual entailment. Springer, Berlin, pp 177–190.
Georgia, pp 32–43 https://doi.org/10.1007/11736790 9
4. Agirre E, Diab M, Cer D, Gonzalez-Agirre A (2012) Semeval- 17. Dzikovska MO, Bental D, Moore JD, Steinhauser NB, Campbell
2012 task 6: A pilot on semantic textual similarity. In: Proceedings GE, Farrow E, Callaway CB (2010) Intelligent tutoring with
of the first joint conference on lexical and computational natural language support in the beetle – ii system. In: Sustaining
semantics - Volume 1: proceedings of the main conference and the TEL: from innovation to learning and practice. Springer, Berlin,
shared task, and volume 2: Proceedings of the sixth international pp 620–625. https://doi.org/10.1007/978-3-642-16020-2 64
workshop on semantic evaluation, SemEval ’12. Association for 18. Fellbaum C, Miller G (1998) WordNet: an electronic lexical
Computational Linguistics, Stroudsburg, pp 385–393. http://dl. database. Combining Local Context and Wordnet Similarity for
acm.org/citation.cfm?id=2387636.2387697 Word Sense Identification. MIT Press, Cambridge, pp 265–283
5. Agirre E, Gonzalez-Agirre A, Lopez-Gazpio I, Maritxalar M,
19. Fyshe A, Wehbe L, Talukdar P, Murphy B, Mitchell T (2015) A
Rigau G, Uria L (2016) Semeval-2016 task 2: Interpretable seman-
compositional and interpretable semantic space. In: Proceedings
tic textual similarity. In: Proceedings of the 10th international
of the 2015 conference of the North American chapter of
workshop on semantic evaluation (SemEval-2016). Association
the association for computational linguistics: human language
for Computational Linguistics, San Diego, pp 512–524. http://
technologies. Association for Computational Linguistics, pp 32–
www.aclweb.org/anthology/S16-1082
41. http://www.aclweb.org/anthology/N15-1004
6. Agirrea E, Baneab C, Cardiec C, Cerd D, Diabe M, Gonzalez-
20. Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y (2006)
Agirrea A, Guof W, Lopez-Gazpioa I, Maritxalara M, Mihalceab
A closer look at skip–gram modelling. In: LREC’06.
R, Rigaua G, Uriaa L, Wiebe J (2015) Semeval-2015 task 2:
G. Majumder et al.
European Language Resources Association (ELRA), pp 1222– 10th international workshop on semantic evaluation, SemEval ’16.
1225 ACM, San Diego, pp 783–789
21. Hand DJ, Yu K (2001) Idiot’s bayes: Not So Stupid after All? 36. Majumder G, Pakray P, Avendaño DEP (2018) Interpretable
International Statistical Review / Revue Internationale De Statis- semantic textual similarity using lexical and cosine similarity.
tique 69(3):385–98. http://doi.org/10.2307/1403452. Accessed In: Social transformation – digital way. Springer, Singapore,
January 12, 2020 pp 717–732. https://doi.org/10.1007/978-981-13-1343-1 59
22. Harris ZS (1954) Distributional Structure, WORD, 10:2-3 37. Majumder G, Pakray P, Pinto D (2019) Measuring interpretable
pp 146–162. https://doi.org/10.1080/00437956.1954.11659520 semantic similarity of sentences using a multi chunk aligner. J
23. Hänig C, Remus R, de la Puente X (2015) ExB themis: Exten- Intell Fuzzy Syst 36(5):4797–4808. https://doi.org/10.3233/JIFS-
sive feature extraction from word alignments for semantic textual 179028
similarity. In: Proceedings of the 9th international workshop on 38. Mihalcea R, Corley C, Strapparava C (2006) Corpus–based
semantic evaluation (SemEval 2015). Association for Computa- and knowledge–based measures of text semantic similarity.
tional Linguistics, https://doi.org/10.18653/v1/s15-2046 In: Proceedings of the 21st national conference on artificial
24. Islam A, Inkpen D (2008) Semantic text similarity using corpus- intelligence - Volume 1. AAAI Press, Boston, pp 775–780
based word similarity and string similarity. ACM Trans Knowl 39. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013)
Discov Data (TKDD) 2(2):10 Distributed representations of words and phrases and their
25. Jackendoff R (1983) Semantics and cognition (Current studies in compositionality. In: Proceedings of the 26th international
linguistics series). MIT Press, Cambridge conference on neural information processing systems - Volume 2,
26. Jaynes ET (1957) Information theory and statistical mechan- NIPS’13. Curran Associates Inc., USA, pp 3111–3119. http://dl.
ics. Phys Rev 106(4):620–630. https://doi.org/10.1103/PhysRev. acm.org/citation.cfm?id=2999792.2999959
106.620 40. Nielsen RD, Ward W, Martin JH (2009) Recognizing entailment
27. Jordan PW, Makatchev M, Pappuswamy U, VanLehn K, Albacete in intelligent tutoring systems*. In: Natural language engineering,
PL (2006) A natural language tutorial dialogue system for physics. vol 15. Cambridge University Press, New York, pp 479–501.
In: Proceedings of the nineteenth international florida artificial https://doi.org/10.1017/S135132490999012X
intelligence research society conference, pp 521–526. http://www. 41. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion
aaai.org/Library/FLAIRS/2006/flairs06-102.php B, Grisel O, Blondel M, Müller A, Nothman J, Louppe G,
28. Karumuri S, Vuggumudi VKR, Chitirala SCR (2015) Umduluth– Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A,
blueteam : Svcsts – a multilingual and chunk level semantic Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit–
similarity system (semeval 2015). In: Proceedings of the 9th learn: Machine learning in python. J Mach Learn Res, 2825–
International workshop on semantic evaluation. Association for 2830
Computational Linguistics, pp 107–110. http://doi.org/10.18653/ 42. Pennington J, Socher R, Manning C (2014) Glove: Global vectors
v1/S15-2019. http://www.aclweb.org/anthology/S15-2019 for word representation. In: Proceedings of the 2014 conference
29. Kazmi M, Schüller P (2016) Inspire at SemEval-2016 task on empirical methods in natural language processing (EMNLP).
2: Interpretable semantic textual similarity alignment based on Association for Computational Linguistics, Doha, pp 1532–
answer set programming. In: Proceedings of the 10th international 1543. https://doi.org/10.3115/v1/D14-1162. https://www.aclweb.
workshop on semantic evaluation (SemEval-2016). Association org/anthology/D14-1162
for Computational Linguistics, San Diego, pp 1109–1115, 43. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010)
https://doi.org/10.18653/v1/s16-1171 Collecting image annotations using amazon’s mechanical turk.
30. Koeling R (2000) Chunking with maximum entropy models. In: In: Proceedings of the NAACL HLT 2010 workshop on cre-
Proceedings of the 2nd workshop on Learning language in logic ating speech and language data with Amazon’s mechanical
and the 4th conference on Computational natural language learn- turk, CSLDAMT ’10. Association for Computational Linguis-
ing -, CoNLL’2000. Association for Computational Linguistics, tics, Stroudsburg, pp 139–147. http://dl.acm.org/citation.cfm?
pp 139–141. https://doi.org/10.3115/1117601.1117634 id=1866696.1866717
31. Konopik M, Prazak O, Steinberger D, Brychcı́n T (2016) UWB 44. Ritter A, Mausam, Etzioni O (2010) A latent dirichlet allocation
at SemEval-2016 task 2: Interpretable semantic textual sim- method for selectional preferences. In: Proceedings of the 48th
ilarity with distributional semantics for chunks. In: Proceed- annual meeting of the association for computational linguistics,
ings of the 10th international workshop on semantic evalua- ACL’10. Association for Computational Linguistics, Stroudsburg,
tion (SemEval-2016). Association for Computational Linguistics, pp 424–434
https://doi.org/10.18653/v1/s16-1124 45. Salton G, Singhal A, Mitra M, Buckley C (1997) Automatic text
32. Lesk M (1986) Automatic sense disambiguation using machine structuring and summarization. Inform Process Manag 33(2):193–
readable dictionaries: How to tell a pine cone from an ice cream 207
cone. In: Proceedings of the 5th annual international conference on 46. Sang EFTK, Buchholz S (2000) Introduction to the conll-
systems documentation, SIGDOC ’86. ACM, New York, pp 24– 2000 shared task: chunking. In: Proceedings of the 2nd
26. https://doi.org/http://doi.acm.org/10.1145/318723.318728 workshop on learning language in logic and the 4th conference
33. Li Y, McLean D, Bandar Z, O’Shea J, Crockett K (2006) Sentence on computational natural language learning-Volume 7, ConLL
similarity based on semantic nets and corpus statistics. IEEE ’00. Association for Computational Linguistics, pp 127–132.
Trans. Knowl. Data Eng. 18(8):1138–1150. https://doi.org10. https://doi.org/10.3115/1117601.1117631
1109/tkde.2006.130 47. Schütze H (1998) Automatic word sense discrimination. Comput
34. Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Linguist 24(1):97–123
Uria L, Agirre E (2017) Interpretable semantic textual similarity: 48. Self J (1990) Theoretical foundations for intelligent tutoring
Finding and explaining differences between sentences. Knowl- systems. J Artif Intell Ed 1(4):3–14. http://dl.acm.org/citation.
Based Syst 119:186–199. https://doi.org/10.1016/j.knosys.2016. cfm?id=95885.95888
12.013 49. Sultan MA, Bethard S, Sumner T (2014) Back to basics for
35. Magnolini S, Feltracco A, Magnini B (2016) Fbk-hlt-nlp at monolingual alignment: Exploiting word similarity and contextual
semeval-2016 task 2 : A multitask, deep learning approach for evidence. Trans Assoc Comput Linguist 2:219–230. http://dl.acm.
interpretable semantic textual similarity. In: Proceedings of the org/citation.cfm?id=1614025.1614037
50. Tekumalla L, Jat S (2016) IISCNLP at SemEval-2016 task 2: Dr. Partha Pakray is an Assis-
Interpretable STS with ILP based multiple chunk aligner. In: Pro- tant Professor in the Depart-
ceedings of the 10th international workshop on semantic evalua- ment of Computer Science &
tion (SemEval-2016). Association for Computational Linguistics, Engineering at National Insti-
San Diego, pp 790–795, https://doi.org/10.18653/v1/s16-1122 tute of Technology, Silchar,
51. Thadani K, McKeown K (2011) Optimal and syntactically– India since August, 2018. He
informed decoding for monolingual phrase-based alignment. In: worked as an Assistant Pro-
Proceedings of the 49th annual meeting of the association for com- fessor at National Institute of
putational linguistics: human language technologies: short papers- Technology, Mizoram, India
volume 2. Association for Computational Linguistics, Portland, from 2015 to 2018. Besides,
pp 254–259. https://www.aclweb.org/anthology/P11-2044 he was a researcher at CIC-
52. Šarić F, Glavaš G, Karan M, Šnajder J, Bašić BD (2012) Takelab: IPN, Mexico, funded by DST-
systems for measuring semantic text similarity. In: Proceedings of CONACYT (Govt. of India)
the first joint conference on lexical and computational semantics - and guest researcher at Uni-
volume 1: Proceedings of the main conference and the shared task, versity of Bremen, Germany,
and volume 2: Proceedings of the sixth international workshop on funded by DST-DAAD (Govt.
semantic evaluation, SemEval ’12. Association for Computational of India). He received his B.E. degree in Computer Science and
Linguistics, Stroudsburg, pp 441–448 Engineering from Jalpaiguri Govt. Engineering College, West Ben-
53. Wegrzyn-Wolska K, Szczepaniak PS (2005) Classification of gal, India in 2004 while he received M.E. (CSE) and Ph.D. (Engg.)
RSS-formatted documents using full text similarity measures. In: degree from Jadavpur University, West Bengal, India in 2007 and 2013
Lecture notes in computer science. Springer, Berlin, pp 400–405. respectively. He received dual post-doctoral fellow, one from Norwe-
https://doi.org/10.1007/11531371 52 gian University of Science & Technology (NTNU), Norway, funded
54. Wu Z, Palmer M (1994) Verb semantics and lexical selection. by ERCIM (European Union) and another from Masaryk University
In: Proceedings of the 32nd annual meeting on association for (MU), Czech Republic, funded by ERCIM & MU (European Union).
computational linguistics, ACL’94, pp 133–138 His area of research interest and specialization include Natural Lan-
55. Yao X, Durme BV, Callison-Burch C, Clark P (2013) A guage Processing (Machine Translation, Semantic Textual Similarity,
lightweight and high performance monolingual word aligner. In: Text Entailment, Question Answering, Information Retrieval, Image
Proceedings of the 51st annual meeting of the association for Captioning, Text Summarization), Machine Learning, Deep Learning,
computational linguistics (volume 2: Short Papers). Association and Machine Intelligence. He has received 01 National Project funded
for Computational Linguistics, Sofia, pp 702–707. https://www. by SERB, DST (Govt. of India) and 03 international project funded
aclweb.org/anthology/P13-2123 by DST-DAAD (Germany), ASEAN-DST (Malaysia and Indonesia),
DST-CEFIPRA (France). He has published more than 100 papers in
Publisher’s note Springer Nature remains neutral with regard to research articles in reputed journals, conferences, workshops, sympo-
jurisdictional claims in published maps and institutional affiliations. siums. He is very active in several international conferences organized
by local and international associations, institutions, or universities.
Mr. Goutam Majumder com-

peted his Bachelor and Mas- Dr. Ranjita Das is currently
ters in Computer Science & serving as a Head &Assistant
Engineering from NIT Agar- Professor at the department of
tala and Tripura University (A Computer Science and Engi-
Central University) in the year neering, National Institute of
2008 and 2013 respectively. Technology Mizoram. She has
After graduation, he joined C- joined National Institute of
DAC Pune to complete his PG Technology Mizoram in the
Diploma in Advance Comput- year 2011 as an Assistant Pro-
ing. He served as a Research fessor. Her research was in
Associate in the Biometrics the area of Pattern recognition,
Laboratory of Tripura Uni- Information retrieval, Compu-
versity (A Central University) tational biology and Machine
from 2009 - 2013. From the learning. She has published
year 2014 to 2018, he served 15 journal and International
as an Assistant Professor in the Computer Science & Engineer- conference papers. Under her
ing department of NIT Mizoram and at present, he is pursuing his supervision presently, six research scholars are doing research work.
PhD degree from NIT Mizoram and serving as an Assistant Profes-
sor in Lovely Professional University. He has published many Book
Chapters, International and National Level Conference and Journal
Paper. His area of interest is Artificial Intelligence, Natural Language
Processing and Computer Vision.
G. Majumder et al.
Prof. David Pinto received

his PhD in computer science
in the area of artificial intel-
ligence and pattern recogni-
tion at the Polytechnic Uni-
versity of Valencia, Spain in
2008. At present he is a full
time professor at the Fac-
ulty of Computer Science of
the Benemerita Universidad
Autonoma de Puebla (BUAP)
where he leads the PhD pro-
gram on Language & Knowl-
edge Engineering. His areas
of interest include clustering,
information retrieval, cross-
lingual NLP tasks and computational linguistics in general. He has
published more than 100 research publications in NLP and artificial
intelligence.

Majumder2021 Article InterpretableSemanticTextualSi

Uploaded by

Copyright:

Available Formats

You might also like

Majumder2021 Article InterpretableSemanticTextualSi

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Majumder2021 Article InterpretableSemanticTextualSi

Uploaded by

Copyright:

Available Formats

Applied Intelligence

Interpretable semantic textual similarity of sentences using

Accepted: 12 December 2020

1 Introduction semantic relations (such as synonyms, antonym, entailment,

Fig. 1 Graphical representation

Fig. 2 Various components of

Pair of Sentences Gold Standard Chunks

Fig. 3 Step–wise output of

STEP - 4 A DOG NAPPING UNDER A SMALL TABLE chunks

STEP - 5 NP VP PP chunk type

Dataset Input with Chunking Rules Chunk Chunk Type

News headlines 2 French Journalists Killed in Mali

Image captions A man sitting in a cluttered room

Student Answer Bulb C was in an open path

meaning of a word. It will return the chunks as IOB tags.

4.3.1 Jaccard similarity coeﬃcient

chunks are more similar. 4.3.4 Word embeddings

|C1 ∩ C2 | Three different libraries of word embeddings are used to

The similarity score (W uPsimi ) can be 0 <

path similarity considers the ‘is–a’ hierarchies of words to

To measure the WuP and path relatedness score, we

Table 3 Example of aligned chunks with alignment scores. These

Chunk 1 Chunk 2 Score

on tree branches on tree branches 5

event for VP chunks. In Table 4 an example of each

4.4.1 Gaussian Naive Bayes

A class (Ck ) is predicted by considering the observation

where μk and σk2 represents the mean and Bessel corrected

4.4.2 k–Nearest Neighbours (knn)

The Scikit–learn machine learning library is used to

Parameter Type Parameter Values Parameter Type Parameter Values

number of neighbors 9 kernel polynomial

The LIVSVM11 library is used here to train the support

Fig. 5 Textual representation of

Headlines Images Answer

Train Test All % Train Test All % Train Test All %

Table 8 Results analysis of rule-based and classifier-based chunkers

Dataset Chunker Type IOB Accuracy P R F1 SL

headlines rule 90.30 73.10 89.60 77.01 75.76

images rule 93.90 87.00 88.80 87.89 71.45

answer rule 92.50 90.40 82.30 86.16 77.60

Chunk Type Sentence as a Chunk String

The Pearson correlation coefficient (r) is used

Chunk Type Sentence as a Chunk String

Gold Standard [ A young woman ] [ with a black top and a necklace ]

Dataset Headlines Images Answer

Features Train Test Train Test Train Test

signle alignment 0.8529 0.8429 0.8769 0.8952 0.8742 0.8892

Dataset Headlines Images Answer

Features Train Test Train Test Train Test

single alignment 0.8422 0.8369 0.8246 0.8142 0.8014 0.8142

Features x1 x2 x3 x4 x5 x6 x7 x8 x9 Similarty Score

Headlines Image captions Student answer

x1 0.050 2.847 0.004 0.153 4.145 0.000 0.107 4.715 0.000

Model Headlines Image captions Student answer

MAE MSE RMSE MAE MSE RMSE MAE MSE RMSE

Headlines Images Answer

SVC EQUI 501 0 1 109 47 25 624 0 0 0 0 0 563 0 0 1 0 0

KNN EQUI 501 1 25 82 52 22 624 0 0 0 0 0 563 0 1 0 0 0