JQL-Noah-MalayReviewRG

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/273836531
Evaluation of Lexical-Based Approaches to the Semantic Similarity of Malay

Sentences
Article in Journal of Quantitative Linguistics · April 2015

DOI: 10.1080/09296174.2014.1001637
CITATIONS READS
10 1,891
3 authors:
Shahrul Azman Mohd Noah Nazlia Omar

Universiti Kebangsaan Malaysia Universiti Kebangsaan Malaysia
241 PUBLICATIONS 2,332 CITATIONS 91 PUBLICATIONS 1,122 CITATIONS
SEE PROFILE SEE PROFILE
Amru Yusrin
Universiti Kebangsaan Malaysia
2 PUBLICATIONS 25 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Image Annotation View project
MultiModality Image Retrieval View project
All content following this page was uploaded by Shahrul Azman Mohd Noah on 06 August 2015.
The user has requested enhancement of the downloaded file.

Title: Evaluation of Lexical-Based Approaches to the Semantic Similarity of
Malay Sentences
(the final version of this paper appeared in JOURNAL OF QUANTITATIVE
LINGUISTICS 22(2) · APRIL 2015)
Authors:
Shahrul Azman Noah, Nazlia Omar, and Amru Yusrin Amruddin
Knowledge Technology Research Group
Faculty of Information Science and Technology
43600 UKM Bangi Selangor
Malaysia
Corresponding author:
Shahrul Azman Noah
Faculty of Information Science & Technology
43600 UKM Bangi Selangor
Malaysia
e-mail: shahrul@ukm.edu.my; samn.ukm@gmail.com
tel.: +6-03-89216343, +6-013-3306626
fax: +6-03-89256732
Evaluation of Lexical-Based Approaches to the Semantic
Similarity of Malay Sentences
Abstract. We evaluate existing and modified approaches for measuring the
semantic similarity of sentences in the Malay language. These approaches are
mainly used for English sentences and no studies to date have evaluated and
compared their effectiveness when applied to Malay sentences. We used a
preprocessed Malay machine-readable dictionary to calculate word-to-word
semantic similarity with two methods: probability of intersection and
normalization. We then used the word-to-word semantic similarity measure to
identify semantic sentence similarity. We evaluated five measures of semantic
sentence similarity: vector-based semantic similarity, word order similarity,
highest word-to-sentence similarity, and combinations of vector-based and word-
to-sentence similarity and of word order and word-to-sentence similarity.. We also
evaluated the effects of including and excluding lexical components such as
prepositions, conjunctions, verbs, and morphological variants.
Keywords: semantic similarity; Malay language; natural language processing;
<2 >
1 Introduction
The concept of similarity is very important in many applications related to information
retrieval and natural language processing (Egozi, Markovitch, & Gabrilovich 2011; Zhang,
Gentile & Ciravegna. 2012). Similarity measures are functions that calculate numeric values
associating (pairs of) objects. For example, a vector-space document retrieval system measures
the distance between a query and documents in the system. Here, a shorter distance indicates
greater similarity. The use of external sources such as domain corpora and lexical databases to
improve similarity measures using semantic meanings are some common approaches (Liu, Yu
& Meng, 2004). Most of the previous work on measuring similarity (Lemaire & Denhiere 2006;
Bollegala, Matsuo & Ishizuka, 2011), however, has been based purely on word and phrase
similarity, thus ignoring the semantics that such words have in the context of the surrounding
sentence or even document. Semantic sentence similarity, in contrast, is a measure whereby a
set of sentences or terms within term lists are assigned a metric based on the likeness of their
meaning content (Cilibrasi & Vitanyi, 2006). The emphasis on word-to-word similarity metrics
is undoubtedly due to the availability of resources that explicitly specify the relations among
words, such as WordNet (ZoongCheng, 2009; Liu & Wang 2013). Although sentence similarity
is assumed to be complex in comparison to word similarity, it performs better than measures
based on word similarity in applications such as paragraph retrieval (Verbene 2007; Verbene,
Boves, Oostdijk & Coppen, 2008), conversational agents (O'Shea, 2012), expert systems (Lee,
2011), and text summarization (Aliguliyev, 2009).
A number of researchers have investigated semantic sentence similarity for English
(Metzler, Bernstein, Croft, Moffat & Zobel, 2005; Castillo and Cardenas, 2010; Li, McLean,
Bandar, O’Shea, & Crockett, 2006) and have reported encouraging results. However, there
<3 >
have been no investigations to date that compared approaches to sentence similarity for the
Malay language. The Malay language is an Austronesian language spoken by the Malay people
and people of other races who reside in the Malay Peninsula, southern Thailand, the
Philippines, Singapore, central eastern Sumatra, the Riau islands, and parts of the coast of
Borneo. It has been an official language of Malaysia, Brunei, Singapore, Indonesia, and East
Timor. Therefore, there are many paper and digital documents written in the Malay language
and there is a great need for systems and algorithms to process such documents. In this paper,
we describe an experiment in measuring the semantic similarity of Malay sentences. Our
experiment is based on a number of dimensions such as word stemming, word-order vectors,
word semantic vectors, and word-to-word similarity. Section 2 provides the background
knowledge for our experiment and an overview of related research. Section 3 describes our
method for measuring the semantic similarity of Malay sentences and the evaluative
experiment that we conducted. Section 4 reports the results of our experiment, and Section 5
presents our conclusion and directions for future work.
2 Background and Related Work
As previously mentioned, the common approaches to measuring semantic similarity are
mainly at the word or concept level, with many fewer at the sentential level. A similarity
function for sentence similarity will, given two sentences (or segments), generate a score that
indicates their relatedness. Most sentence similarity measures, however, are mainly concerned
with “calculating” the presence or absence of words in the compared sentences, and popular
methods include word overlap measures, term frequency–inverse document frequency (TF-
IDF) measures, relative frequency measures, and probabilistic models. Semantic sentence
similarity measures, in contrast, extend these conventional approaches by calculating a score
<4 >
for a pair of sentences that indicates their similarity at the semantic level. Sentence similarity
has been reported to be useful in applications such as question-answering systems (Qui, Bu,
Chen, Huang & Cai, 2007), text categorization (Ko, Park & Seo 2004), and paraphrase
recognition (Mihalcea, Corley & Strapparave, 2006). Mihalcea et al. (2006), for example,
reported that their experiment shows that a semantic sentence similarity measure outperforms
simpler vector-based similarity on paraphrase recognition tasks.
In a semantic sentence similarity measure, the first task is to obtain word-to-word
semantic measures for the compared sentences, and then scoring functions can be used to
generate the similarity value between the sentences. A relatively large number of word-to-word
similarity measures have previously been proposed in the literature. According to Mihalcea et
al. (2006), these fall into two groups: corpus-based measures and knowledge-based measures.
Corpus-based measures of semantic word similarity seek to identify the similarity between
words using information derived from large corpora (Turney, 2001; Karov & Ederment, 1998).
Turney (2001) proposed the “pointwise mutual information measure,” which was based on
term co-occurrence counts over large corpora. Another popular approach is latent semantic
analysis (LSA), whereby term co-occurrences are captured by means of dimensionality
reduction using singular value decomposition (SVD). Knowledge-based measures, on the other
hand, identify semantic similarity between words by using information from a dictionary or a
thesaurus to calculate degrees of relatedness among words. For example, Leacock and
Chodorow’s (1998) method counts the number of nodes on the shortest path between two
concepts in WordNet. Resnik (1995) and Li, McLean, Bandar, O’Shea, and Crockett (2006)
also use WordNet to calculate semantic measures. Lesk’s (1986) method defines semantic
similarity between two words based on overlap measures between the corresponding dictionary
definitions.
<5 >
Experiments on semantic sentence similarity for English have shown promising results.
Mihalcea et al. (2006) proved that incorporating semantic information into measures of
sentence similarity significantly increased the recognition likelihood as compared to the vector-
based cosine similarity. They experimented with the corpus-based and knowledge-based
approaches. In corpus-based approach, the degree of similarity between words was derived
from large corpora whereas in knowledge-based approach such similarity was derived from
WordNet from using several measures such as Leacock & Chodorow and Lesk. The work by
Li et al. (1986) proposed a semantic sentence similarity measure using WordNet and corpus
statistics. The similarity measure is based on semantic and order information. Detail
explanation about this method is discussed in the next section of this paper. Their work focused
on short sentences which are featured in applications such as conversational agents and
dialogue systems. Results from their experiments showed that the proposed method provides
similarity measures that are fairly consistent with human knowledge.
As mentioned earlier sentence similarity measures benefited in many applications. One
of the earliest applications of text similarity is probably the vector space model of information
retrieval. In this case, the relevancy of documents to a given user query is determined by
ranking algorithms that measure the similarity of the query vector and the documents vector
(Salton & Lesk 1971). Since then, text similarity has gained research interest in various
application such as relevance feedback and text classification (Rocchio 1971), and word sense
disambiguation (Lesk 1986; Schutze 1998),
Recently with the advancement of information retrieval field and availability of large
textual corpus and knowledge sources, semantic sentence similarity have receive attention in
some applications. In question-answering systems, semantic similarity measures between
sentences play an important role in finding similar questions in the archive of users’ request.
Qiu et al. (2007) for example showed that how syntactic information embedded in similarity
<6 >
measures could overcome some other base-line retrieval models. In textual recognition,
Mihacea et al. (2006) experimented various semantic similarity measures in paraphrase
recognitions and showed that they outperform other vectorial based models.
There is presently no research that compares semantic sentence similarity measures for
Malay sentences. There is some general research on Malay document retrieval. Ahmad, Yusoff,
and Sembok (1996) and Othman (1993), for instance, proposed algorithms for the stemming
of Malay words, Abdullah, Ahmad, Mahmod and Sembok (2003) applied the latent semantic
index approach to Malay-English cross-language document retrieval, Kong and Yusoff (1995)
made an effort towards English-Malay machine translation and recently Noor, Noah, Aziz and
Hamzah (2012) investigate methods for anaphora detection in Malay text. But so far, no
research has focused directly on the semantic similarity of Malay sentences.
3 Semantic Similarity Measures for Malay Sentences
A semantic sentence similarity measure compares a pair of sentences S1 and S2 and
automatically generates a value that indicates their similarity. The comparison of S1 and S2 is
usually done by means of word-to-word similarity measures among the constituent words in S1
and S2. Therefore, assuming that S1 and S2 can be represented as finite vectors of words {w1,
w2, w3,…,wm} and {v1, v2, v3,…,vn}, respectively, a number of possible scoring functions
proposed in the literature can be applied. The simplest would be to consider all the possible
similarities among the constituent words, as indicated in the following equation:
𝒔𝒔𝒔𝒔𝒔𝒔(𝑺𝑺𝟏𝟏 , 𝑺𝑺𝟐𝟐 ) = ∑𝒎𝒎 .𝒏𝒏

𝒊𝒊=𝟏𝟏 ∑𝒋𝒋=𝟏𝟏 𝒔𝒔𝒔𝒔𝒔𝒔�𝒘𝒘𝒊𝒊 , 𝒗𝒗𝒋𝒋 � (1)
However, this would be impractical and require huge processing with complexity O(n2). In this
experiment, we therefore considered several approaches to measuring sentence similarity
mainly inspired by the work of Mihalcea et al. (2006) and Li et al. (2006). However, before
<7 >
describing the experiment, we first discuss the approaches to measuring the semantic similarity
between words.
3.1 Malay Word-to-Word Semantic Similarity
As previously mentioned, semantic similarity measures between words can be grouped
into corpus-based measures and knowledge-based measures. We chose to focus on knowledge-
based measures, as there currently exists no large corpus for Malay language sources.
Furthermore, as a linguistic database for the Malay language similar to WordNet is not yet
available, we chose to use an existing lexical dictionary. The lexical dictionary contains 69,344
rows of data with 48,177 Malay words, based on the 4th edition of the Kamus Dewan (“Dewan
Dictionary”; an established Malay dictionary published by Dewan Bahasa and Pustaka).
However, the dictionary is not yet available in a machine readable dictionary (MRD) format—
i.e., the dictionary is available only in in a human readable format—so some preprocessing was
required. The dictionary was parsed by filtering and eliminating symbols, short-form words,
verbs, and other words not found in the dictionary.
Our investigation of several methods for knowledge-based measures determined that
the only suitable method for our purpose was Lesk’s (1986) method. This is due to the nature
of the generated MRD dictionary, which only contains meanings of words and not the
hierarchical structure of words that models human common-sense knowledge about general
language usage such as is found in WordNet (Miller, 1995).
Using Lesk’s method, the similarity sim(w1, w2) of words w1 and w2 can be calculated
using either the probability of intersection or normalization. The probability of intersection
uses the following equation:
𝒔𝒔𝒔𝒔𝒔𝒔𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 (𝒘𝒘𝟏𝟏 , 𝒘𝒘𝟐𝟐 ) = 𝑷𝑷�𝑴𝑴𝒘𝒘𝟏𝟏 |𝑪𝑪� ∙ 𝑷𝑷�𝑴𝑴𝒘𝒘𝟐𝟐 |𝑪𝑪� (2)
<8 >
where M denotes the meaning of the subscripted word, C is the set of unique overlap words
found in the meanings of w1 and w2, and 𝑃𝑃�𝑀𝑀𝑤𝑤1 |𝐶𝐶� refers to the probability of the meaning of
word w1 containing an instance of C. The normalization method, on the other hand, is based on
the following equation:
𝑷𝑷�𝑴𝑴𝒘𝒘𝟏𝟏 |𝑪𝑪�+𝑷𝑷�𝑴𝑴𝒘𝒘𝟐𝟐 |𝑪𝑪�

𝒔𝒔𝒔𝒔𝒔𝒔𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏 (𝒘𝒘𝟏𝟏 , 𝒘𝒘𝟐𝟐 ) = (3)
𝟐𝟐
As can be seen from equation (3), the normalization method is very similar to the probabilistic
method, except that the probabilities for the meanings of word w1 and w2 are normalized.
Follows illustrate the calculation of both word-to-word semantic similarity methods. Assume
that we want to find the similarity between the words sekolah (school) and madrasah (religious
school). Referring to the MRD there are eight unique overlap words between sekolah and
madrasah (i.e. C = 10), and the total number of unique words in the meaning of sekolah and
madrasah 15 and 11 respectively. Therefore P(Msekolah|C) = 0.667 and P(Mmadrasah|C) = 0.909.
By using equation (2) and (3) respectively, we will obtain simprob(sekolah, madrasah) = 0.606,
and simnorm(sekolah, madrasah) = 0.788.
3.2 Semantic Similarity Measures between Sentences
The derived word-to-word semantic similarity values discussed above are used to
define the semantic similarity values between two sentences. A comparison of the semantic
similarity between two sentences can be implemented into two ways, either by comparing each
word in sentence S1 with all the words in sentence S2 and generating the similarity values based
on these word-to-word similarities or by constructing a joint distinct word set for the two
sentences. Assuming that we are comparing sentences S1 and S2, a set of distinct joint words S
is formed from S1 and S2 as follows:
<9 >
S = S1 ∪ S2,where S1 = {w1, w2, w3,…,wm} and S2 = {v1, v2, v3,…,vn}
For example, assuming that we have the sentences S1: Saya berjalan ke sekolah (I walked to
school) and S2: Dia berkereta ke bandar (He drove to town), then we will have S = {saya
berjalan ke sekolah dia berkereta bandar}. The joint word set S is used to derive the various
semantic measures.
In this experiment, we consider four measures for semantic sentence similarity: word
order similarity, highest word-to-sentence similarity, semantic sentence similarity, and a hybrid
of word order similarity and semantic sentence similarity. These measures require a measure
of word-to-word similarity as previously described. In order to establish a suitable value ξ for
a threshold for deciding whether compared words or terms are semantically similar, an
experiment involving 200 pairs of synonyms was conducted. The synonyms were derived
based on Moidin (2008). A pair of terms ti and tj are considered similar if sim(ti, tj) > ξ, while
similarity values less than ξ are considered to indicate that the words are not semantically
similar. For the word-to-word similarity methods considered in this study, i.e., the probabilistic
and normalization methods, ξ = 0.18 and ξ = 0.37 were selected as the respective threshold
values. The threshold values were empirically derived, whereby when using the probability
methods, word similarity values of less than 0.18 and 0.37 for the probability and normalization
methods respectively are intuitively very dissimilar.
3.2.1 Similarity Based on Semantic Vector
The similarity method based on semantic vectors (Simv) uses the joint word set S as a
basis to derive semantic information about the compared sentences S1 and S2. The joint word
set S is viewed as providing the semantic information for the compared sentences. There is an
open question about whether to consider morphological variants. Li et al. (2006) do not
consider morphological variants. In this case, the Malay words for makan (eat), makanan
< 10 >
(food), and pemakanan (nutrition) are considered to be three unique words and can all appear
in the joint set S. However, Noah, Amruddin, and Omar (2007) argue that morphological
variations among words play a significant role in deriving sentence similarity values as shown
in their simple experiment. We consider both cases in this experiment and will discuss in the
Results section the effects these cases have on determining the similarity values.
To derive the semantic information content of S1 and S2, term-term matrices for the two
sentences are constructed as follows:
S = w1 w2 ... ... ... ... wn

 q1  x1,1 x1, 2 .. .. .. .. x1,n 
q  x x2, 2 .. .. .. .. x2,n 
 2  2,1
Si =  .  . .. .. .. .. .. . 
 .  . 
.. .. .. .. .. . 
 
q m  xm ,1 xm, 2 .. .. .. .. xm ,n 
where xi,j represents the similarity measure between the ith word qi in the compared sentence
and the jth word wj of the joint word set S. The value of xi,j = 1 if qi and wj are the same word,
whereas if qi ≠ wj, the similarity measure is computed using the previously described word-to-
word semantic similarity method.
The raw semantic vector š for Si (i = 1,2) can then be computed with š = {max(x1,1, …,
xm,1), …, max(x1,n, …, xm,n)}. For example, if S1 = {negara, Malaysia, aman, sentosa} and S2 =
{negara jepun maju}, then we have S = {negara, Malaysia, aman, sentosa, jepun, maju}.
Comparing the joint set S with S1 and S2, we will obtain the following term-term matrix
respectively:
1 0 0 0 0 0
0 1 0 0 0 0
�0 0 1 0 0 0�
0 0 0 1 0 0
< 11 >
0 0 0 0 1 0
�1 0.667 0 0 0 0�
0 0 0 0 0 1
and therefore, the raw semantic vector š for S1 and S2 will be {1, 1, 1, 1, 0, 0} and {1, 0.667, 0,
0, 0, 0} respectively.
For the calculation of the semantic vector si, the following formula is then used:
𝑠𝑠𝑖𝑖 = 𝑠𝑠̌ ∙ 𝐼𝐼(𝑤𝑤𝑖𝑖 ) ∙ 𝐼𝐼(𝑤𝑤

� 𝑖𝑖 ) (4)
where 𝑤𝑤𝑖𝑖 is a word in the joint word set S and 𝑤𝑤

� 𝑖𝑖 is its associated word in the sentence. The
value of I(w), which is the weight of word w, is calculated with reference to the MRD
dictionary, using the following formula:
log(n + 1)
I ( w) = 1 − (5)
log( N + 1)
where n is the number of rows of meaning containing the word w and N is the total number of
� i ) allows the two constituent words

rows (of meaning) in the dictionary. The use of I(wi) and I(w
to contribute to the similarity based on their individual information contents (Li et al. 2006).
By using equations (4) and (5), we will obtained the following semantic vector of s1 and s2.
s1 = {0.204, 0.286, 0.464, 0.767, 0, 0}
s2 = {0.204, 0.161, 0, 0, 0.342, 0.408}
< 12 >
Finally, the semantic similarity between the two compared sentences is simply the cosine
coefficient between the two semantic vectors:
s1 ⋅ s2
Simv ( S1 , S 2 ) = cos( s1 , s2 ) = (6)
s1 × s2
Therefore, the Simv between S1 and S2 is 0.154.
3.2.2 Word Order Similarity Measure between Sentences
The measure for word order similarity (Simo) is a straightforward process based on the
distinct joint word set S. Assuming that we have the following pair of sentences S1 and S2:
S1: Negara Malaysia aman sentosa
S2: Jepun negara maju
we will have the joint word set S = {negara, Malaysia, aman, sentosa, Jepun, maju}. Similarly
to semantic similarity, the word order vector is derived from the joint set S. A term-term matrix
is constructed and the word-to-word similarity measure is calculated using the method
discussed in section 3.2.1. The resulting matrix for the sentence S1 and the joint set S is similar
to the one presented in section 3.2.1, but for readability we present it again as follows:
< 13 >
Malaysia
sentosa
S
negara
Jepun
aman
maju
S1
negara 1 0 0 0 0 0
Malaysia 0 1 0 0 0 0
aman 0 0 1 0 0 0
sentosa 0 0 0 1 0 0
u1 = (1 2 3 4 0 0)
The word order vector u1 for S1 is constructed based on the joint existence or the highest word-
to-word similarity between the joint set S and S1. Therefore we have u1 = (1 2 3 4 0 2); the last
value of u1 is equal to 2 because the word maju in S is strongly similar to the word Malaysia,
which is the second position in S1. Similarly, we have u2 = (2 2 3 3 1 3) derived from the
following matrix:
Malaysia
sentosa
S
negara
Jepun
aman
Maju
S2
Jepun 0 0 0 0 1 0
negara 1 0.667 0 0 0 0
Maju 0 0 0 0 0 1
u2 = (2 2 0 0 1 3)
Using the word order similarity defined as follows:
u1 − u2
Simo ( S1 , S 2 ) = 1 − (7)
u1 + u2
we have Simo(S1, S2) = 0.828. The word order similarity in (7) is determined by the normalized
difference of the words.
< 14 >
3.2.3 Semantic Similarity Measures based on Highest Word-to-Sentence Similarity
The highest word-to-sentence similarity approach (Sims) requires comparing each word
in each sentence with all the words in the other sentence. The similarity between the sentences
S1 and S2 is based upon the maximum word-to-word similarity between each word w in S1 and
the words in S2 and vice versa. The similarity measure is therefore calculated using the
following equation:
1 ∑𝑤𝑤∈{𝑆𝑆1 }(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑤𝑤, 𝑆𝑆2 ) × 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)

𝑆𝑆𝑆𝑆𝑆𝑆s (𝑆𝑆1 , 𝑆𝑆2 ) = �
2 ∑𝑤𝑤∈{𝑆𝑆1 } 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)
(8)
∑𝑤𝑤∈{𝑆𝑆2 }(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑤𝑤, 𝑆𝑆1 ) × 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)
+ �
∑𝑤𝑤∈{𝑆𝑆2 } 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)
where masSim(w,S2) identifies the word w in S1 that has the highest similarity with the word in
segment S1 and masSim(w,S2) determines the most similar word in S1 starting with words in S2.
idf(w) measures the specificity of word w using the classic inverse document frequency (idf)
introduced by Sparck-Jones (1972) represented as follows, where N is the total number items
in the collection and dfw is the number items in the collection that has the word w.
𝑁𝑁
𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤) = (9)
𝑑𝑑𝑑𝑑𝑤𝑤
Mihalcea et al. (2006) used this approach for evaluating text semantic similarity based upon
various word-to-word semantic similarity measures. Therefore, for the sentences S1 and S2 in
the previous example, Sims(S1, S2) = 0.292
< 15 >
3.2.4 Combined Semantic Sentence Similarity Measures
Based upon the previous semantic similarity measures, we can derive combined semantic
measures. In this experiment we consider two combinations:
Simv+o = δSimv + (1 – δ)Simo (10)
and
Sims+o = δSims + (1 – δ)Simo (11)
where δ is a damping factor that decides the contribution of the individual similarity measures
used. Li et al. (2006) suggested that δ should be greater than 0.5 due to the importance of lexical
elements presented in the semantic similarity of Simv and Sims (Wiemer-Hastings, 2000).
As a summary, the aforementioned measures for semantic sentence similarity can be
represented in Figure 1. As can be seen, a set of unique words (S) is first generated from the
input compared sentences. The set of unique words is then compared with the input sentences
to generate raw semantic vector and word order vector for the Simv and Simo approaches
respectively. The raw semantic vector is then transformed into semantic similarity whereas the
word order vector is transformed into order similarity as discussed in the previous sections.
The pair of sentences is also directly processed using equation (8) to generate the highest word-
to-sentence similarity (Sims). The semantic similarity is combined with the order similarity to
produced Simv+o, whereas the same order similarity is combined with the highest word-to-
sentence similarity to produce Sims+o.
<Figure 1>
< 16 >
4 Malay Language Grammatical Structure
The Malay language has been the national language of Malaysia since 1955 and is
formally known as Bahasa Malaysia (Chang, 1980). The basic Malay sentence consists of two
major constituents: subject and predicate. Similar to English, the subject and predicate can be
derived from noun phrases, verb phrases, or adjective phrases. According to Karim (1995),
there are four basic patterns in Malay as summarized in Table 1:
<Table 1>
There are four types of sentences in Malay: declarative, interrogative, imperative, and
exclamatory (Karim, 1995). A declarative sentence usually declares something about the
subject, e.g., “Ahmad pegawai di firma itu” (Ahmad is an officer in that firm). An interrogative
sentence usually poses a question such as “Siapa dia?” (Who is he?). An imperative sentence
makes a request or gives an order like “Keluar dari sini” (Get out of here). An exclamatory
sentence expresses emotion such as joy, surprise, or anger, e.g., “Wah…Besarnya rumah
ini!”(Wow…This house is huge!).
There are four categories of words in Malay: nouns, verbs, adjectives and function
words. Function words in Malay are words in various positions of a sentence that provide a
specific syntactic function, such as conjoining, modifying, emphasizing, negating, and
indicating specificity (Karim, 1995). Examples of function words are dan, jikalau, setelah, and
hanya.
As mentioned earlier, a Malay online dictionary, Kamus Dewan (DBP 2005), was used
as the lexical resource in this study. This dictionary consists of 48,177 Malay words with
69,334 definitions. Necessary steps were taken to filter out the function words, as these words
< 17 >
are not significant in measuring semantic similarities. Other types of words that were filtered
out include abbreviations such as “dll” and “sbg” and symbols such as “~” and “@.” However,
the symbol “-” is considered important due to its role in identifying reduplication words such
as “kanak-kanak” and “layang-layang.” Reduplication of nouns generally gives a semantic
category of heterogeneity or an indefinite plural, while reduplication of verbals results in
semantic features that signify repetition, continuity, habituality, intensity, extensiveness, and
resemblance (Chang, 1980). Affixation also exists in Malay, whereby a base form is extended
by one or more affixes. In Malay, the affixes can be classified as prefixes, suffixes, infixes, and
circumfixes. Morphemes in the form of root words, stems, and affixes are taken into
consideration, as these may provide additional semantic features. In short, semantic similarity
features can be influenced by including or excluding function words, symbols, abbreviations,
and morphemes.
5 Results and Discussion
Testing was conducted for 200 pairs of common Malay sentences. These sentence pairs were
first rated by humans with a value between 0.0 and 1.0, where 0.0 indicates that the sentences
in the pair are not related at all and 1.0 indicates that the sentences are exactly the same or
similar. We decided to select a threshold value of 0.5 to indicate whether the pair of sentences
are semantically similar. The human-rated similarities (sometimes called the “gold standard”)
were then compared with the values derived from the similarity measures described in Section
3. Before proceeding to the analysis of results, we first provide a small walk-through example
of how the results compare for the various similarity measures. Some examples of testing
results are illustrated in Table 2.
< 18 >
Table 2 separates the results into vector-based semantic similarities (Simv), order similarities
(Simo), highest word-to-sentence similarities (Sims), and the combination of similarities Simv
and Simo (Simv+o) and of similarities Sims and Simo (Sims+o), with ∂ = 0.5. The testing illustrated
in Table 2 compares the first sentence of the list with the remaining six sentences. To facilitate
the discussion, we refer to the first sentence of the list as the “target sentence” and the remaining
six sentences as the “compared sentences.” The “human ranking of similarity” is the ranked
given by human to the compared sentence in terms of their similarity to the target sentence .
The results in Table 2 show a consistent outcome between the human similarity ranking and
the automated sentence similarity measures, with very minimal differences.
<Table 2>
Table 3 shows selected results from the initial testing. The intention is to provide the
initial and general outcome of each approach as compared to the human similarity judgements.
<Table 3>
As can be seen in Table 3, the sentences in pair 1 were correctly identified as similar
by all approaches except Simv. Simv is specifically concerned with differences between the
word order vectors, and it seems that this is not enough to produce useful similarity values. The
sentences in pairs 2 and 3 were correctly identified as respectively semantically similar and not
similar by all approaches. However, from the values for pairs 4 and 5, we can see the effect of
the connective terms “kerana” (because) and “kalau” (if). The sentences in pair 4 which should
be identified as similar, however, are decisively identified as not semantically similar by all
approaches. However, changing the word “kerana” to “kalau” in the second sentence of pair 4
to produce pair 5 seems to have a significant effect for the Simo and Sims+o approaches. In the
case of pair 6, the Sims approach wrongly classified the pair as semantically similar. This might
be due to the number of similar words in the two sentences, but they were semantically
irrelevant due to the presence of the different nouns “kedai” (shop) and “sekolah” (school). The
< 19 >
sentences in pair 7 were wrongly classified as semantically similar by the Simo approach, as
the approach focuses on the word order similarity. In turn, the Simv+o and Sims+o values were
influenced by the high Simo similarity values.
Previous work in this area did not consider morphological variants among words.
However, our further observations found that morphological variants do have an impact on
sentence similarity. To illustrate this, consider the following compared sentences and their
similarity measures Simv+o. The underlined words are morphological variants in Malay. In the
first case, “kahwin” (married) is the root word for “berkahwin” (got married), and in the second
case, “baca” (read) is the root word for “membaca” (reading).
S1 = Saya suka lelaki bujang itu. (I like that bachelor man.)
S2 = Saya suka lelaki belum berkahwin itu. (I like that unmarried man.)
S3 = Saya suka lelaki belum kahwin itu. (I like that unmarried man.)
Simv+o (S1, S2) = 0.58; Simv+o (S1, S3) = 0.90
S4 = Saya suka mengaji buku. (I like to recite the book.)
S5 = Saya suka membaca buku. (I like reading the Quran.)
S6 = Saya suka baca buku. (I like to read the Quran.)
Simv+o (S4, S5) = 0.66; Simv+o (S4, S6) = 0.85
As we can see, words that are stemmed to their root words give higher similarity measures.
Therefore, aspects of morphological variants should be considered. However, considering
morphological variants might be a drawback for machine processing .
< 20 >
Based on these “walk-through” observations, we designed our testing so that it
considers the aforementioned issues and elements, i.e., morphological variants of terms and the
effects of connective words (conjunctions), prepositions, and verbs. Table 4 shows the result
of the testing compared with the human judgements. It shows the percentage of accurate
identifications for each approach, or the ability to correctly identify the similarity for all pairs
(usually referred as “recall values”). Concerning the word-to-word similarity methods, the
results clearly show that the probability of intersection provides better outcomes, as evidenced
by experiments 6–10, with the percentage of accurate identifications for all approaches
increased between 5% (the Simo approach) and 10.84% (the Sims approach).
<Table 4>
We summarize the analysis of our results according to the following dimensions:
i. Stemmed or morphological variants: A comparison of the results of Experiments 1

and 2 and of Experiments 6 and 7 indicates that stemming improves accuracy for the
majority of approaches, particularly for Sv and Sv+o.
ii. Conjunctions or connective words: The removal of conjunctions has little impact on
the accuracy of the similarity approaches, as evidenced by a comparison of
Experiments 2 and 3 and of Experiments 7 and 8.
iii. Prepositions: The removal of prepositions seems not to have any impact when using
the normalized approach to word-to-word similarity (Experiments 3 and 4).
However, a slight improvement was obtained when using the probabilistic approach
(Experiments 8 and 9).
iv. Verbs: Verbs play important roles in expressing the meanings of sentences. The
results show that the accuracy decreases for the majority of the approaches under the
probabilistic method when we remove such information but is maintained across
most normalization approaches. This is evidenced by a comparison of Experiments
4 and 5 and of Experiments 9 and 10.
< 21 >
7 Conclusions and Future Work
While research in this area has been dominated by studies of the English language, little
work has been focused on the Malay language. In this paper, we presented the results of our
evaluation of five lexical-based approaches to semantic similarity measures for Malay
sentences. These approaches compare pairs of sentences by first finding the similarity measures
between words. The two proposed word-to-word semantic measures are based on probabilistic
intersections and normalization. Our experiment shows consistent and encouraging results that
indicate the promising potential of applying these approaches to the Malay language.
In our experiment, the Malay MRD lexical database proved to be useful for measuring
word-to-word similarity in the absence of a structured knowledge base similar to WordNet.
The normalization and probabilistic similarity measures achieved a maximum of 59% and 67%
accuracy, respectively, which suggests that the probabilistic method is superior. In addition,
our evaluation shows that identification of morphological variants improves the accuracy of
the semantic similarity measure for Malay sentences, while pronouns and conjunctions have
little effect on improving the accuracy. On average, the normalization and probabilistic
methods respectively show an increase in accuracy of 3.00% and 3.76% among the techniques,
with the highest increase shown by the technique based on combined word order similarity and
word-sentence similarity. On the other hand, the removal of verb information either causes a
deterioration of accuracy or makes no difference for the various approaches. Because of the
former, we can argue that verbs play an important role in contributing meaning to sentences.
Our future research plans include applying the sentence similarity measures to
information retrieval activities involving Malay documents. In addition, the evaluation of
word-to-word similarity should be extended to other methods such as the term co-occurrence
< 22 >
corpus-based method and the semantic network method, which will require the construction of
a linguistic ontology similar to WordNet for the Malay language.
Acknowledgments. The authors wish to thank the Ministry of Higher Education for the funds
provided for this project and also the anonymous referees for their helpful and constructive
comments on this paper.
References
Abdullah, M. T., Ahmad, F., Mahmod, R. T., and Sembok, T. M. (2003). Evaluating the
effectiveness of thesaurus and stemming methods in retrieving Malay translated al-Quran
documents. In T. M. T. Sembok, H. B. Zaman, H. Chen, S. R. Urs, and S.-H. Myaeng
(eds.), Proceedings of the 6th International Conference on Asian Digital Libraries,
(ICADL) 2003, Kuala Lumpur, Malaysia, pp. 663–665.
Ahmad, F., Yusoff, M., and Sembok, T. M. (1996). Experiments with a stemming algorithm
for Malay words. Journal of the American Society of Information Science, 47(12), 909–
18.
Aliguliyev, R.M. (2009). A new sentence similarity measure and sentence based extractive
technique for automatic text summarization. Expert Systems with Applications, 36, 7764–
7772.
Bollegala, D., Matsuo, Y., and Ishizuka, M. (2011). A Web Search Engine-Based Approach to
Measure Semantic Similarity between Words. IEEE Trans. Knowl. Data Eng., 23(7),
977-990.
Buitelaar, P., Cimiano, P., and Magnini, B. (2005). Ontology learning from text: An
overview. In P. Buitelaar, P. Cimiano, and B. Magnini (eds.), Ontology Learning from
Text: Methods, Evaluation and Applications, pp. 1–9. Amsterdam: IOS Press.
Castillo, J.J. and Cardenas, M. E. (2010). Using Sentence Semantic Similarity Based on
WordNet in Recognizing Textual Entailment. In A. Kuri-Morales and G. R. Simari
(eds.), Advances in Artificial Intelligence – 12th Ibero-American Conference on AI
(IBERAMIA), Bahía Blanca, Argentina, pp. 366-375.
Chang, M. S. (1980). The morphological analysis of Bahasa Malaysia. In :Proceedings of the

8th Conference on Computational Linguistics, Penang, Malaysia, pp. 578–85.
Cilibrasi, R. and Vitanyi, P. M. B. (2006). Similarity of objects and the meaning of words. In
J-Y Chai, S. B. Cooper and A. Li (eds.), Proceedings of the 3rd Conference on Theory
and Applications of Models of Computation (TAMC), Beijing, China, pp. 21–45.
Egozi, O., Markovitch, S. and Gabrilovich, E. (2011). Concept-Based Information Retrieval

Using Explicit Semantic Analysis. ACM Transactions on Information Systems, 29(2), 8:1-8:34
< 23 >
Kamus Dewan. (2005). Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia.
Karim, N. S. (1995). Malay Grammar for Academics and Professionals. Kuala Lumpur:
Dewan Bahasa dan Pustaka.
Karov, Y. and Edelmen, S. (1998). Similarity-based word sense disambiguation.

Computational Linguistics, 24(1), 41–59.
Ko, Y., Park, J., and Seo, J. (2004). Improving text categorization using the importance of
sentences. Information Processing and Management, 40(1), 65–79.
Kong, T. E. and Yusoff, Z. (1995). Natural language analysis in machine translation (MT)
based on the string-tree correspondence grammar (STCG). In Paper presented at the
10th Pacific Asia Conference on Language, Information and Computation (PACLIC10).
Leacock, C. and Chodorow, M. (1998). Combining local context and WordNet sense
similarity for word sense identification. In C. Fellbaum (ed.), WordNet, an Electronic
Lexical Database, pp. 305–332. Boston: The MIT Press.
Lee, M. C. (2011). A novel sentence similarity measure for semantic-based expert systems.
Expert Systems with Applications, 38(5), 6392–6399
Lemaire, B. and Denhière, G. (2006). Effects of high-order co-occurrences on word semantic

similarity. Current Psychology Letters 18(1). http://cpl.revues.org/document471.html.
Lesk, M. E. (1986). Automatic sense disambiguation using machine readable dictionaries:

How to tell a pine cone from an ice cream cone. In V. DeBuys (ed.), Proceedings of the
5th annual international conference on systems documentation, University of Toronto,
Canada, pp. 24–26.
Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., and Crockett, K. (2006). Sentence
similarity based on semantic nets and corpus statistics. IEEE Transactions on
Knowledge and Data Engineering, 18(8), 1138–1150.
Liu, H. and Wang, P. (2013). Assessing Sentence Similarity Using WordNet based Word
Similarity. Journal of Software, 8(6), 1451-1458.
Liu, S., Liu, F., Yu, C., and Meng, W. (2004). An effective approach to document retrieval
via utilizing WordNet and recognizing phrases. In K. Jarvelin, J. Allan, P. Bruza and M.
Sanderson (esd.), Proceedings of the 27th Annual International ACM SIGIR
Conference, Sheffield, UK, pp. 266–72.
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., and Zobel, J. (2005). Similarity
measures for tracking information flow. In O. Herzog, H-J Schek, N. Fuhr, A.
Chowdhury, and W. Teiken (eds.), Proceedings of the CIKM’05, Bremen, Germany, pp.
571–524.
Mihalcea, R., Corley, C., and Strapparave, C. (2006). Corpus based and knowledge based
measures of text semantic similarity. In A. Cohn (ed.), Proceedings of the American
Association for Artificial Intelligence (AAAI 2006), Boston, Massachusetts, pp. 775–
780.
< 24 >
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM,
38(11), 39–41.
Moidin, A. H. (2008). Sinonim A-Z untuk Pelajar. IBS: Kuala Lumpur.
Noah, S. A., Amruddin, A. Y., and Omar, N. (2007). Semantic similarity measures for Malay
sentences. In D. H-L Goh, T. H. Cao, I. Sølvberg, and E. M. Rasmussen (eds.),
Proceedings of the ICADL 2007, Hanoi, Vietnam, pp. 117–26.
Noor, N. K. M., Noah, S. A., Aziz, M. J. A. and Hamzah, M. P. (2012). Malay Anaphor and
Antecedent Candidate Identification: A Proposed Solution. In J-S. Pan, S-M. Chen, N. T.
Nguyen (eds.) Proceedings of the Asia Conference on Intelligent Information and Database
Systems (ACIIDS) (3), Kaohsiung, Taiwan, pp. 141-151
O'Shea, K. (2012). An approach to conversational agent design using semantic sentence

similarity. Applied Intelligence, 37(4), 558-568.
Othman, A. (1993). Pengakar perkataan melayu untuk sistem capaian dokumen. MSc Thesis.
National University of Malaysia, Bangi, Malaysia.
Qiu, G., Bu, J., Chen, C., Huang, P., and Cai, K. (2007). Syntactic impact on sentence
similarity measure in archive-based QA system. In J. Pei, V. S. Tseng, L. Cao, H.
Motoda, G. Xu (eds.), Proceedings of 11th Asia Pacific Conference on Advances in
Knowledge Discovery and Data Mining, Gold Coast, Australia, pp. 769–76.
Resnik, P. (1995). Using information content to evaluate the semantic similarity. In C. S.

Mellish (ed.), Proceedings of the 14th International Joint Conference on Artificial
Intelligence, Montreal, Canada, pp. 448–453.
Rocchio, J. J. Jr (1971). Relevance feedback in information retrieval. In G. Salton (ed), The

Smart Retrieval Systems - Experiments in Automatic Document Processing. New
Jersey: Prentice-Hall.
Salton, G., and Lesk, M. (1971). Computer evaluation of indexing and text processing.
Journal of the ACM, 15(1), 8-36.
Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24(1),

97–124.
Sparck-Jones, K. 1972. A statistical interpretation of term specificity and its application in

retrieval. Journal of Documentation, 28(1), 11–21.
Turney, P. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In L. D.
Raedt and P. A. Flach (eds.), Proceedings of the 12th European Conference on Machine
Learning, Freiburg, Germany, pp. 491-502.
Verbene, S. 2007. Paragraph retrieval for why-question answering. In W. Kraaij, A. P. de

Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (eds.), Proceedings of the 30th Annual
International ACM SIGIR Conference, Amsterdam, The Netherlands, pp. 922–922.
< 25 >
Verberne, S., Boves, L., Oostdijk, N., and Coppen, P.-A. (2008). Evaluating paragraph
retrieval for why–QA. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, R. W.
White (ed.), Proceedings of the 30th European Conference on IR Research, ECIR 2008,
Glasgow, UK, pp. 669–73.
Wiemer-Hastings, P. (2000). Adding syntactic information to LSA. In L. R. Gleitman and A.

K. Joshi (eds.), Proceedings of the 22nd Annual Conference on Cognitive Science,
Pennsylvania, US, pp. 989–93.
Zeng, H-J., He, Q-C., Chen, Z., Ma, W-Y., and Ma, J. 2004. Learning to cluster web search
results. In M. Sanderson, K. Järvelin, J. Allan, and P. Bruza (eds.), Proceedings of the
27th Annual International ACM SIGIR Conference, Sheffield, UK, pp. 210-217.
Zhang, Z. Q., Gentile, A. N., and Ciravegna, F. (2012). Recent advances in methods of lexical
semantic relatedness – a survey. Natural Language Engineering, 19(4), 411–479.
Zhongcheng, Z. (2009). Measuring semantic similarity based on WordNet. In Sixth Web

Information Systems and Applications Conference, Xuzhou, Jiangsu, China, pp. 88-92.
< 26 >
Figure 1. Distribution of word-to-word similarity measures for synonyms, probabilistic
method
Figure 2. Distribution of word-to-word similarity measures for synonyms, normalization
method
Table 1. Basic sentence patterns in Malay
Pattern Subject Predicate
Pattern (1) Noun Phrase (NP) Subject + Noun Phrase (NP) Predicate
e.g., Encik Ahmad guru sekolah
Pattern (2) Noun Phrase (NP) Subject + Verb Phrase (VP) Predicate
e.g., Bapa mereka sedang berbual
Pattern (3) Noun Phrase (NP) Subject + Adjective Phrase (AP) Predicate
e.g., Anak itu sihat sungguh
Pattern (4) Noun Phrase (NP) Subject + Prepositional Phrase (PP) Predicate
e.g., Pejabat saya di Kuala Lumpur

Table 2. Initial testing results (using cross probability method for unstemmed word-to-word
similarity)
Sentence Sentences Compared Human Simv Simo Simv+o Sims Sims+o

Tested Similarity
Ranking
Target sentence 1
Saya pergi ke sekolah.
(I went to school.)
Saya pergi ke sekolah. 1 1.00 1.00 1.00 1.00 1.00
(I went to school.)
Saya berjalan ke sekolah. 2 0.88 1.00 0.94 0.81 0.90
(I walked to school.)
Saya pergi ke madrasah. 3 0.66 1.00 0.83 0.73 0.87
(I went to the religious school.)
Saya pergi ke kedai. 4 0.44 0.40 0.42 0.66 0.53
(I went to a shop.)
Dia pergi ke kedai. 5 0.37 0.41 0.39 0.53 0.47
(He went to a shop.)
Saya makan nasi di kedai. 6 0.16 0.39 0.27 0.29 0.32
(I ate rice at a restaurant.)
Target sentence 2
Saya membaca buku sambil minum air kopi.
(I read a book while drinking coffee.)
Saya membaca buku sambil 1 1.00 1.00 1.00 1.00 1.00
minum air kopi.
(I read a book while drinking
coffee.)
Saya membaca buku sambil 2 0.76 0.75 0.76 0.87 0.81
minum air teh.
(I read a book while drinking tea.)
Saya membelek majalah 3 0.39 0.67 0.53 0.68 0.62
sambil minum air teh.
(I skimmed a magazine while
drinking tea.)
Saya menonton televisyen 4 0.41 0.67 0.54 0.67 0.61
(I watched television while
drinking tea.)
Ahmad menonton televisyen 5 0.34 0.70 0.52 0.44 0.57
(Ahmad watched television while
drinking tea.)
Saya menonton televisyen 6 0.17 0.22 0.19 0.31 0.26
sambil baring.
(I lay down and watched
television.)
Target Sentence 3
Komputer riba sangat ringan.
(Laptops are very light.)
Komputer riba sangat 1 1.00 1.00 1.00 1.00 1.00
ringan.
(Laptops are very light.)
Komputer riba amat ringan. 2 0.95 0.58 0.76 0.83 0.70
(Laptops are extremely light.)
Komputer riba sangat berat. 3 0.91 1.00 0.95 0.82 0.91
(Laptops are very heavy.)
Kalkulator kecil sangat 4 0.47 0.74 0.61 0.52 0.63
ringan.
(Small calculators are very light.)
Mesin kira sangat ringan. 5 0.71 0.80 0.75 0.70 0.75
(Calculating machines are very
light.)
Meja komputer amat berat. 6 0.39 0.61 0.50 0.33 0.47
(Computer tables are very heavy.)
Target Sentence 4
Agensi kerajaan Malaysia
(Malaysia government agency)
Agensi kerajaan Malaysia 1 1.00 1.00 1.00 1.00 1.00
(Malaysia government agency)
Agensi kerajaan Cina 2 0.81 0.31 0.56 0.69 0.50
(China government agency)
Agensi negara Malaysia 3 0.96 0.89 0.92 0.81 0.85
(Country of Malaysia agency)
Agen negara asing 4 0.36 0.52 0.44 0.25 0.38
(Foreign country agency)
Agen kerajaan Malaysia 5 0.49 0.73 0.61 0.67 0.70
(Malaysia government agent)
Agensi bangsa Malaysia 6 0.87 0.59 0.73 0.72 0.66
(Malaysian tribe agency)
Table 3: Selected initial testing results
Sentence Pairs Human Simv Simo Simv+o Sims Sims+o
similarity
judgement
1Saya main bola. 0.76 0.44 0.61 0.52 0.61 0.60
(I play with the ball.)
Saya tendang bola.

(I kick the ball.)
2Budi bahasa budaya kita. 0.67 0.57 0.70 0.63 0.54 0.62
(Politeness is our culture.)
Akhlak mulia budaya kita.

(Noble behavior is our culture.)
3Budi bahasa budaya kita. 0.32 0.39 0.12 0.26 0.33 0.23
(Politeness is our culture.)
Budaya Melayu unik sekali.

(Malay culture is very unique.)
4Senaman amalan sihat. 0.70 0.26 0.24 0.25 0.27 0.26
(Exercise is a healthy activity.)
Badan sihat kerana bersenam.

(Good health from exercise.)
5Senaman amalan sihat. 0.64 0.36 0.58 0.47 0.42 0.50
(Exercise is a healthy activity.)
Badan sihat kalau bersenam.

(Good health would result from
exercise.)
6Saya pergi ke sekolah. 0.16 0.44 0.40 0.42 0.66 0.53
(I went to school.)
Saya pergi ke kedai.

(I went to a shop.)
7Saya membaca buku sambil 0.19 0.34 0.70 0.52 0.44 0.57
minum air kopi.
(I read books while drinking
coffee.)
Ahmad menonton television

sambil minum air the.
(Ahmad watched television while
drinking tea.)
Table 4. Results of the approaches: percentage of correctly identified sentence similarities
compared to the human standard
Word-to- Exp # Experiment Simv Simo Simv+o Sims Sims+o

word details
similarity
method
1 Unchanged 54.12 56.67 53.33 55.83 57.50
Normalized with ξ = 0.37
2 Stemmed 59.17 57.50 60.00 58.33 57.50

3 Stemmed; without 59.12 58.33 60.83 59.12 58.33
conjunctions
4 Stemmed, without 59.17 58.33 60.83 59.12 58.33
conjunctions and
prepositions
5 Stemmed; without 59.17 58.33 58.33 60.00 58.33
conjunctions,
prepositions, and
verbs
6 Unchanged 61.33 61.67 61.67 66.67 65.83

7 Stemmed 67.50 62.50 66.67 66.67 65.83
8 Stemmed; without 67.50 63.33 66.67 67.50 67.50
conjunctions
with ξ = 0.18
Probabilistic
9 Stemmed, without 67.50 65.00 67.50 68.33 67.50

conjunctions and
prepositions
10 Stemmed, without 65.83 63.33 65.00 68.33 65.00
conjunctions,
prepositions, and
verbs
View publication stats

JQL-Noah-MalayReviewRG

Uploaded by

Copyright:

Available Formats

You might also like

JQL-Noah-MalayReviewRG

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JQL-Noah-MalayReviewRG

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Evaluation of Lexical-Based Approaches to the Semantic Similarity of Malay

Article in Journal of Quantitative Linguistics · April 2015

Shahrul Azman Mohd Noah Nazlia Omar

SEE PROFILE SEE PROFILE

Image Annotation View project

MultiModality Image Retrieval View project

The user has requested enhancement of the downloaded file.

Similarity of Malay Sentences

Abstract. We evaluate existing and modified approaches for measuring the

semantic similarity of sentences in the Malay language. These approaches are

compared their effectiveness when applied to Malay sentences. We used a

preprocessed Malay machine-readable dictionary to calculate word-to-word

semantic similarity with two methods: probability of intersection and

normalization. We then used the word-to-word semantic similarity measure to

identify semantic sentence similarity. We evaluated five measures of semantic

sentence similarity: vector-based semantic similarity, word order similarity,

highest word-to-sentence similarity, and combinations of vector-based and word-

to-sentence similarity and of word order and word-to-sentence similarity.. We also

evaluated the effects of including and excluding lexical components such as

prepositions, conjunctions, verbs, and morphological variants.

Keywords: semantic similarity; Malay language; natural language processing;

The concept of similarity is very important in many applications related to information

sentence or even document. Semantic sentence similarity, in contrast, is a measure whereby a

is assumed to be complex in comparison to word similarity, it performs better than measures

2011), and text summarization (Aliguliyev, 2009).

A number of researchers have investigated semantic sentence similarity for English

we describe an experiment in measuring the semantic similarity of Malay sentences. Our

experiment is based on a number of dimensions such as word stemming, word-order vectors,

presents our conclusion and directions for future work.

2 Background and Related Work

As previously mentioned, the common approaches to measuring semantic similarity are

similarity measures, in contrast, extend these conventional approaches by calculating a score

simpler vector-based similarity on paraphrase recognition tasks.

In a semantic sentence similarity measure, the first task is to obtain word-to-word

analysis (LSA), whereby term co-occurrences are captured by means of dimensionality

similarity measures that are fairly consistent with human knowledge.

As mentioned earlier sentence similarity measures benefited in many applications. One

disambiguation (Lesk 1986; Schutze 1998),

some applications. In question-answering systems, semantic similarity measures between

Mihacea et al. (2006) experimented various semantic similarity measures in paraphrase

research has focused directly on the semantic similarity of Malay sentences.

3 Semantic Similarity Measures for Malay Sentences

A semantic sentence similarity measure compares a pair of sentences S1 and S2 and

similarities among the constituent words, as indicated in the following equation:

𝒔𝒔𝒔𝒔𝒔𝒔(𝑺𝑺𝟏𝟏 , 𝑺𝑺𝟐𝟐 ) = ∑𝒎𝒎 .𝒏𝒏

experiment, we therefore considered several approaches to measuring sentence similarity

3.1 Malay Word-to-Word Semantic Similarity

As previously mentioned, semantic similarity measures between words can be grouped

into corpus-based measures and knowledge-based measures. We chose to focus on knowledge-

Dictionary”; an established Malay dictionary published by Dewan Bahasa and Pustaka).

verbs, and other words not found in the dictionary.

Our investigation of several methods for knowledge-based measures determined that

language usage such as is found in WordNet (Miller, 1995).

using either the probability of intersection or normalization. The probability of intersection

uses the following equation:

𝒔𝒔𝒔𝒔𝒔𝒔𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 (𝒘𝒘𝟏𝟏 , 𝒘𝒘𝟐𝟐 ) = 𝑷𝑷�𝑴𝑴𝒘𝒘𝟏𝟏 |𝑪𝑪� ∙ 𝑷𝑷�𝑴𝑴𝒘𝒘𝟐𝟐 |𝑪𝑪� (2)

the following equation:

𝑷𝑷�𝑴𝑴𝒘𝒘𝟏𝟏 |𝑪𝑪�+𝑷𝑷�𝑴𝑴𝒘𝒘𝟐𝟐 |𝑪𝑪�