JQL-Noah-MalayReviewRG

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273836531

Evaluation of Lexical-Based Approaches to the Semantic Similarity of Malay


Sentences

Article in Journal of Quantitative Linguistics · April 2015


DOI: 10.1080/09296174.2014.1001637

CITATIONS READS

10 1,891

3 authors:

Shahrul Azman Mohd Noah Nazlia Omar


Universiti Kebangsaan Malaysia Universiti Kebangsaan Malaysia
241 PUBLICATIONS 2,332 CITATIONS 91 PUBLICATIONS 1,122 CITATIONS

SEE PROFILE SEE PROFILE

Amru Yusrin
Universiti Kebangsaan Malaysia
2 PUBLICATIONS 25 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Image Annotation View project

MultiModality Image Retrieval View project

All content following this page was uploaded by Shahrul Azman Mohd Noah on 06 August 2015.

The user has requested enhancement of the downloaded file.


Title: Evaluation of Lexical-Based Approaches to the Semantic Similarity of
Malay Sentences
(the final version of this paper appeared in JOURNAL OF QUANTITATIVE
LINGUISTICS 22(2) · APRIL 2015)

Authors:
Shahrul Azman Noah, Nazlia Omar, and Amru Yusrin Amruddin
Knowledge Technology Research Group
Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia
43600 UKM Bangi Selangor
Malaysia

Corresponding author:
Shahrul Azman Noah
Faculty of Information Science & Technology
Universiti Kebangsaan Malaysia
43600 UKM Bangi Selangor
Malaysia
e-mail: shahrul@ukm.edu.my; samn.ukm@gmail.com
tel.: +6-03-89216343, +6-013-3306626
fax: +6-03-89256732
Evaluation of Lexical-Based Approaches to the Semantic

Similarity of Malay Sentences

Abstract. We evaluate existing and modified approaches for measuring the

semantic similarity of sentences in the Malay language. These approaches are

mainly used for English sentences and no studies to date have evaluated and

compared their effectiveness when applied to Malay sentences. We used a

preprocessed Malay machine-readable dictionary to calculate word-to-word

semantic similarity with two methods: probability of intersection and

normalization. We then used the word-to-word semantic similarity measure to

identify semantic sentence similarity. We evaluated five measures of semantic

sentence similarity: vector-based semantic similarity, word order similarity,

highest word-to-sentence similarity, and combinations of vector-based and word-

to-sentence similarity and of word order and word-to-sentence similarity.. We also

evaluated the effects of including and excluding lexical components such as

prepositions, conjunctions, verbs, and morphological variants.

Keywords: semantic similarity; Malay language; natural language processing;

<2 >
1 Introduction

The concept of similarity is very important in many applications related to information

retrieval and natural language processing (Egozi, Markovitch, & Gabrilovich 2011; Zhang,

Gentile & Ciravegna. 2012). Similarity measures are functions that calculate numeric values

associating (pairs of) objects. For example, a vector-space document retrieval system measures

the distance between a query and documents in the system. Here, a shorter distance indicates

greater similarity. The use of external sources such as domain corpora and lexical databases to

improve similarity measures using semantic meanings are some common approaches (Liu, Yu

& Meng, 2004). Most of the previous work on measuring similarity (Lemaire & Denhiere 2006;

Bollegala, Matsuo & Ishizuka, 2011), however, has been based purely on word and phrase

similarity, thus ignoring the semantics that such words have in the context of the surrounding

sentence or even document. Semantic sentence similarity, in contrast, is a measure whereby a

set of sentences or terms within term lists are assigned a metric based on the likeness of their

meaning content (Cilibrasi & Vitanyi, 2006). The emphasis on word-to-word similarity metrics

is undoubtedly due to the availability of resources that explicitly specify the relations among

words, such as WordNet (ZoongCheng, 2009; Liu & Wang 2013). Although sentence similarity

is assumed to be complex in comparison to word similarity, it performs better than measures

based on word similarity in applications such as paragraph retrieval (Verbene 2007; Verbene,

Boves, Oostdijk & Coppen, 2008), conversational agents (O'Shea, 2012), expert systems (Lee,

2011), and text summarization (Aliguliyev, 2009).

A number of researchers have investigated semantic sentence similarity for English

(Metzler, Bernstein, Croft, Moffat & Zobel, 2005; Castillo and Cardenas, 2010; Li, McLean,

Bandar, O’Shea, & Crockett, 2006) and have reported encouraging results. However, there

<3 >
have been no investigations to date that compared approaches to sentence similarity for the

Malay language. The Malay language is an Austronesian language spoken by the Malay people

and people of other races who reside in the Malay Peninsula, southern Thailand, the

Philippines, Singapore, central eastern Sumatra, the Riau islands, and parts of the coast of

Borneo. It has been an official language of Malaysia, Brunei, Singapore, Indonesia, and East

Timor. Therefore, there are many paper and digital documents written in the Malay language

and there is a great need for systems and algorithms to process such documents. In this paper,

we describe an experiment in measuring the semantic similarity of Malay sentences. Our

experiment is based on a number of dimensions such as word stemming, word-order vectors,

word semantic vectors, and word-to-word similarity. Section 2 provides the background

knowledge for our experiment and an overview of related research. Section 3 describes our

method for measuring the semantic similarity of Malay sentences and the evaluative

experiment that we conducted. Section 4 reports the results of our experiment, and Section 5

presents our conclusion and directions for future work.

2 Background and Related Work

As previously mentioned, the common approaches to measuring semantic similarity are

mainly at the word or concept level, with many fewer at the sentential level. A similarity

function for sentence similarity will, given two sentences (or segments), generate a score that

indicates their relatedness. Most sentence similarity measures, however, are mainly concerned

with “calculating” the presence or absence of words in the compared sentences, and popular

methods include word overlap measures, term frequency–inverse document frequency (TF-

IDF) measures, relative frequency measures, and probabilistic models. Semantic sentence

similarity measures, in contrast, extend these conventional approaches by calculating a score

<4 >
for a pair of sentences that indicates their similarity at the semantic level. Sentence similarity

has been reported to be useful in applications such as question-answering systems (Qui, Bu,

Chen, Huang & Cai, 2007), text categorization (Ko, Park & Seo 2004), and paraphrase

recognition (Mihalcea, Corley & Strapparave, 2006). Mihalcea et al. (2006), for example,

reported that their experiment shows that a semantic sentence similarity measure outperforms

simpler vector-based similarity on paraphrase recognition tasks.

In a semantic sentence similarity measure, the first task is to obtain word-to-word

semantic measures for the compared sentences, and then scoring functions can be used to

generate the similarity value between the sentences. A relatively large number of word-to-word

similarity measures have previously been proposed in the literature. According to Mihalcea et

al. (2006), these fall into two groups: corpus-based measures and knowledge-based measures.

Corpus-based measures of semantic word similarity seek to identify the similarity between

words using information derived from large corpora (Turney, 2001; Karov & Ederment, 1998).

Turney (2001) proposed the “pointwise mutual information measure,” which was based on

term co-occurrence counts over large corpora. Another popular approach is latent semantic

analysis (LSA), whereby term co-occurrences are captured by means of dimensionality

reduction using singular value decomposition (SVD). Knowledge-based measures, on the other

hand, identify semantic similarity between words by using information from a dictionary or a

thesaurus to calculate degrees of relatedness among words. For example, Leacock and

Chodorow’s (1998) method counts the number of nodes on the shortest path between two

concepts in WordNet. Resnik (1995) and Li, McLean, Bandar, O’Shea, and Crockett (2006)

also use WordNet to calculate semantic measures. Lesk’s (1986) method defines semantic

similarity between two words based on overlap measures between the corresponding dictionary

definitions.

<5 >
Experiments on semantic sentence similarity for English have shown promising results.

Mihalcea et al. (2006) proved that incorporating semantic information into measures of

sentence similarity significantly increased the recognition likelihood as compared to the vector-

based cosine similarity. They experimented with the corpus-based and knowledge-based

approaches. In corpus-based approach, the degree of similarity between words was derived

from large corpora whereas in knowledge-based approach such similarity was derived from

WordNet from using several measures such as Leacock & Chodorow and Lesk. The work by

Li et al. (1986) proposed a semantic sentence similarity measure using WordNet and corpus

statistics. The similarity measure is based on semantic and order information. Detail

explanation about this method is discussed in the next section of this paper. Their work focused

on short sentences which are featured in applications such as conversational agents and

dialogue systems. Results from their experiments showed that the proposed method provides

similarity measures that are fairly consistent with human knowledge.

As mentioned earlier sentence similarity measures benefited in many applications. One

of the earliest applications of text similarity is probably the vector space model of information

retrieval. In this case, the relevancy of documents to a given user query is determined by

ranking algorithms that measure the similarity of the query vector and the documents vector

(Salton & Lesk 1971). Since then, text similarity has gained research interest in various

application such as relevance feedback and text classification (Rocchio 1971), and word sense

disambiguation (Lesk 1986; Schutze 1998),

Recently with the advancement of information retrieval field and availability of large

textual corpus and knowledge sources, semantic sentence similarity have receive attention in

some applications. In question-answering systems, semantic similarity measures between

sentences play an important role in finding similar questions in the archive of users’ request.

Qiu et al. (2007) for example showed that how syntactic information embedded in similarity
<6 >
measures could overcome some other base-line retrieval models. In textual recognition,

Mihacea et al. (2006) experimented various semantic similarity measures in paraphrase

recognitions and showed that they outperform other vectorial based models.

There is presently no research that compares semantic sentence similarity measures for

Malay sentences. There is some general research on Malay document retrieval. Ahmad, Yusoff,

and Sembok (1996) and Othman (1993), for instance, proposed algorithms for the stemming

of Malay words, Abdullah, Ahmad, Mahmod and Sembok (2003) applied the latent semantic

index approach to Malay-English cross-language document retrieval, Kong and Yusoff (1995)

made an effort towards English-Malay machine translation and recently Noor, Noah, Aziz and

Hamzah (2012) investigate methods for anaphora detection in Malay text. But so far, no

research has focused directly on the semantic similarity of Malay sentences.

3 Semantic Similarity Measures for Malay Sentences

A semantic sentence similarity measure compares a pair of sentences S1 and S2 and

automatically generates a value that indicates their similarity. The comparison of S1 and S2 is

usually done by means of word-to-word similarity measures among the constituent words in S1

and S2. Therefore, assuming that S1 and S2 can be represented as finite vectors of words {w1,

w2, w3,…,wm} and {v1, v2, v3,…,vn}, respectively, a number of possible scoring functions

proposed in the literature can be applied. The simplest would be to consider all the possible

similarities among the constituent words, as indicated in the following equation:

𝒔𝒔𝒔𝒔𝒔𝒔(𝑺𝑺𝟏𝟏 , 𝑺𝑺𝟐𝟐 ) = ∑𝒎𝒎 .𝒏𝒏


𝒊𝒊=𝟏𝟏 ∑𝒋𝒋=𝟏𝟏 𝒔𝒔𝒔𝒔𝒔𝒔�𝒘𝒘𝒊𝒊 , 𝒗𝒗𝒋𝒋 � (1)
However, this would be impractical and require huge processing with complexity O(n2). In this

experiment, we therefore considered several approaches to measuring sentence similarity

mainly inspired by the work of Mihalcea et al. (2006) and Li et al. (2006). However, before

<7 >
describing the experiment, we first discuss the approaches to measuring the semantic similarity

between words.

3.1 Malay Word-to-Word Semantic Similarity

As previously mentioned, semantic similarity measures between words can be grouped

into corpus-based measures and knowledge-based measures. We chose to focus on knowledge-

based measures, as there currently exists no large corpus for Malay language sources.

Furthermore, as a linguistic database for the Malay language similar to WordNet is not yet

available, we chose to use an existing lexical dictionary. The lexical dictionary contains 69,344

rows of data with 48,177 Malay words, based on the 4th edition of the Kamus Dewan (“Dewan

Dictionary”; an established Malay dictionary published by Dewan Bahasa and Pustaka).

However, the dictionary is not yet available in a machine readable dictionary (MRD) format—

i.e., the dictionary is available only in in a human readable format—so some preprocessing was

required. The dictionary was parsed by filtering and eliminating symbols, short-form words,

verbs, and other words not found in the dictionary.

Our investigation of several methods for knowledge-based measures determined that

the only suitable method for our purpose was Lesk’s (1986) method. This is due to the nature

of the generated MRD dictionary, which only contains meanings of words and not the

hierarchical structure of words that models human common-sense knowledge about general

language usage such as is found in WordNet (Miller, 1995).

Using Lesk’s method, the similarity sim(w1, w2) of words w1 and w2 can be calculated

using either the probability of intersection or normalization. The probability of intersection

uses the following equation:

𝒔𝒔𝒔𝒔𝒔𝒔𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 (𝒘𝒘𝟏𝟏 , 𝒘𝒘𝟐𝟐 ) = 𝑷𝑷�𝑴𝑴𝒘𝒘𝟏𝟏 |𝑪𝑪� ∙ 𝑷𝑷�𝑴𝑴𝒘𝒘𝟐𝟐 |𝑪𝑪� (2)

<8 >
where M denotes the meaning of the subscripted word, C is the set of unique overlap words

found in the meanings of w1 and w2, and 𝑃𝑃�𝑀𝑀𝑤𝑤1 |𝐶𝐶� refers to the probability of the meaning of

word w1 containing an instance of C. The normalization method, on the other hand, is based on

the following equation:

𝑷𝑷�𝑴𝑴𝒘𝒘𝟏𝟏 |𝑪𝑪�+𝑷𝑷�𝑴𝑴𝒘𝒘𝟐𝟐 |𝑪𝑪�


𝒔𝒔𝒔𝒔𝒔𝒔𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏 (𝒘𝒘𝟏𝟏 , 𝒘𝒘𝟐𝟐 ) = (3)
𝟐𝟐

As can be seen from equation (3), the normalization method is very similar to the probabilistic

method, except that the probabilities for the meanings of word w1 and w2 are normalized.

Follows illustrate the calculation of both word-to-word semantic similarity methods. Assume

that we want to find the similarity between the words sekolah (school) and madrasah (religious

school). Referring to the MRD there are eight unique overlap words between sekolah and

madrasah (i.e. C = 10), and the total number of unique words in the meaning of sekolah and

madrasah 15 and 11 respectively. Therefore P(Msekolah|C) = 0.667 and P(Mmadrasah|C) = 0.909.

By using equation (2) and (3) respectively, we will obtain simprob(sekolah, madrasah) = 0.606,

and simnorm(sekolah, madrasah) = 0.788.

3.2 Semantic Similarity Measures between Sentences

The derived word-to-word semantic similarity values discussed above are used to

define the semantic similarity values between two sentences. A comparison of the semantic

similarity between two sentences can be implemented into two ways, either by comparing each

word in sentence S1 with all the words in sentence S2 and generating the similarity values based

on these word-to-word similarities or by constructing a joint distinct word set for the two

sentences. Assuming that we are comparing sentences S1 and S2, a set of distinct joint words S

is formed from S1 and S2 as follows:

<9 >
S = S1 ∪ S2,where S1 = {w1, w2, w3,…,wm} and S2 = {v1, v2, v3,…,vn}

For example, assuming that we have the sentences S1: Saya berjalan ke sekolah (I walked to

school) and S2: Dia berkereta ke bandar (He drove to town), then we will have S = {saya

berjalan ke sekolah dia berkereta bandar}. The joint word set S is used to derive the various

semantic measures.

In this experiment, we consider four measures for semantic sentence similarity: word

order similarity, highest word-to-sentence similarity, semantic sentence similarity, and a hybrid

of word order similarity and semantic sentence similarity. These measures require a measure

of word-to-word similarity as previously described. In order to establish a suitable value ξ for

a threshold for deciding whether compared words or terms are semantically similar, an

experiment involving 200 pairs of synonyms was conducted. The synonyms were derived

based on Moidin (2008). A pair of terms ti and tj are considered similar if sim(ti, tj) > ξ, while

similarity values less than ξ are considered to indicate that the words are not semantically

similar. For the word-to-word similarity methods considered in this study, i.e., the probabilistic

and normalization methods, ξ = 0.18 and ξ = 0.37 were selected as the respective threshold

values. The threshold values were empirically derived, whereby when using the probability

methods, word similarity values of less than 0.18 and 0.37 for the probability and normalization

methods respectively are intuitively very dissimilar.

3.2.1 Similarity Based on Semantic Vector

The similarity method based on semantic vectors (Simv) uses the joint word set S as a

basis to derive semantic information about the compared sentences S1 and S2. The joint word

set S is viewed as providing the semantic information for the compared sentences. There is an

open question about whether to consider morphological variants. Li et al. (2006) do not

consider morphological variants. In this case, the Malay words for makan (eat), makanan
< 10 >
(food), and pemakanan (nutrition) are considered to be three unique words and can all appear

in the joint set S. However, Noah, Amruddin, and Omar (2007) argue that morphological

variations among words play a significant role in deriving sentence similarity values as shown

in their simple experiment. We consider both cases in this experiment and will discuss in the

Results section the effects these cases have on determining the similarity values.

To derive the semantic information content of S1 and S2, term-term matrices for the two

sentences are constructed as follows:

S = w1 w2 ... ... ... ... wn


 q1  x1,1 x1, 2 .. .. .. .. x1,n 
q  x x2, 2 .. .. .. .. x2,n 
 2  2,1
Si =  .  . .. .. .. .. .. . 
 .  . 
.. .. .. .. .. . 
 
q m  xm ,1 xm, 2 .. .. .. .. xm ,n 

where xi,j represents the similarity measure between the ith word qi in the compared sentence

and the jth word wj of the joint word set S. The value of xi,j = 1 if qi and wj are the same word,

whereas if qi ≠ wj, the similarity measure is computed using the previously described word-to-

word semantic similarity method.

The raw semantic vector š for Si (i = 1,2) can then be computed with š = {max(x1,1, …,

xm,1), …, max(x1,n, …, xm,n)}. For example, if S1 = {negara, Malaysia, aman, sentosa} and S2 =

{negara jepun maju}, then we have S = {negara, Malaysia, aman, sentosa, jepun, maju}.

Comparing the joint set S with S1 and S2, we will obtain the following term-term matrix

respectively:

1 0 0 0 0 0
0 1 0 0 0 0
�0 0 1 0 0 0�
0 0 0 1 0 0

< 11 >
0 0 0 0 1 0
�1 0.667 0 0 0 0�
0 0 0 0 0 1

and therefore, the raw semantic vector š for S1 and S2 will be {1, 1, 1, 1, 0, 0} and {1, 0.667, 0,

0, 0, 0} respectively.

For the calculation of the semantic vector si, the following formula is then used:

𝑠𝑠𝑖𝑖 = 𝑠𝑠̌ ∙ 𝐼𝐼(𝑤𝑤𝑖𝑖 ) ∙ 𝐼𝐼(𝑤𝑤


� 𝑖𝑖 ) (4)

where 𝑤𝑤𝑖𝑖 is a word in the joint word set S and 𝑤𝑤


� 𝑖𝑖 is its associated word in the sentence. The

value of I(w), which is the weight of word w, is calculated with reference to the MRD

dictionary, using the following formula:

log(n + 1)
I ( w) = 1 − (5)
log( N + 1)

where n is the number of rows of meaning containing the word w and N is the total number of

� i ) allows the two constituent words


rows (of meaning) in the dictionary. The use of I(wi) and I(w

to contribute to the similarity based on their individual information contents (Li et al. 2006).

By using equations (4) and (5), we will obtained the following semantic vector of s1 and s2.

s1 = {0.204, 0.286, 0.464, 0.767, 0, 0}

s2 = {0.204, 0.161, 0, 0, 0.342, 0.408}

< 12 >
Finally, the semantic similarity between the two compared sentences is simply the cosine

coefficient between the two semantic vectors:

s1 ⋅ s2
Simv ( S1 , S 2 ) = cos( s1 , s2 ) = (6)
s1 × s2

Therefore, the Simv between S1 and S2 is 0.154.

3.2.2 Word Order Similarity Measure between Sentences

The measure for word order similarity (Simo) is a straightforward process based on the

distinct joint word set S. Assuming that we have the following pair of sentences S1 and S2:

S1: Negara Malaysia aman sentosa

S2: Jepun negara maju

we will have the joint word set S = {negara, Malaysia, aman, sentosa, Jepun, maju}. Similarly

to semantic similarity, the word order vector is derived from the joint set S. A term-term matrix

is constructed and the word-to-word similarity measure is calculated using the method

discussed in section 3.2.1. The resulting matrix for the sentence S1 and the joint set S is similar

to the one presented in section 3.2.1, but for readability we present it again as follows:

< 13 >
Malaysia

sentosa
S

negara

Jepun
aman

maju
S1
negara 1 0 0 0 0 0
Malaysia 0 1 0 0 0 0
aman 0 0 1 0 0 0
sentosa 0 0 0 1 0 0
u1 = (1 2 3 4 0 0)

The word order vector u1 for S1 is constructed based on the joint existence or the highest word-

to-word similarity between the joint set S and S1. Therefore we have u1 = (1 2 3 4 0 2); the last

value of u1 is equal to 2 because the word maju in S is strongly similar to the word Malaysia,

which is the second position in S1. Similarly, we have u2 = (2 2 3 3 1 3) derived from the

following matrix:
Malaysia

sentosa

S
negara

Jepun
aman

Maju

S2
Jepun 0 0 0 0 1 0
negara 1 0.667 0 0 0 0
Maju 0 0 0 0 0 1
u2 = (2 2 0 0 1 3)

Using the word order similarity defined as follows:

u1 − u2
Simo ( S1 , S 2 ) = 1 − (7)
u1 + u2

we have Simo(S1, S2) = 0.828. The word order similarity in (7) is determined by the normalized

difference of the words.

< 14 >
3.2.3 Semantic Similarity Measures based on Highest Word-to-Sentence Similarity

The highest word-to-sentence similarity approach (Sims) requires comparing each word

in each sentence with all the words in the other sentence. The similarity between the sentences

S1 and S2 is based upon the maximum word-to-word similarity between each word w in S1 and

the words in S2 and vice versa. The similarity measure is therefore calculated using the

following equation:

1 ∑𝑤𝑤∈{𝑆𝑆1 }(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑤𝑤, 𝑆𝑆2 ) × 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)


𝑆𝑆𝑆𝑆𝑆𝑆s (𝑆𝑆1 , 𝑆𝑆2 ) = �
2 ∑𝑤𝑤∈{𝑆𝑆1 } 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)
(8)
∑𝑤𝑤∈{𝑆𝑆2 }(𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑤𝑤, 𝑆𝑆1 ) × 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)
+ �
∑𝑤𝑤∈{𝑆𝑆2 } 𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤)

where masSim(w,S2) identifies the word w in S1 that has the highest similarity with the word in

segment S1 and masSim(w,S2) determines the most similar word in S1 starting with words in S2.

idf(w) measures the specificity of word w using the classic inverse document frequency (idf)

introduced by Sparck-Jones (1972) represented as follows, where N is the total number items

in the collection and dfw is the number items in the collection that has the word w.

𝑁𝑁
𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤) = (9)
𝑑𝑑𝑑𝑑𝑤𝑤

Mihalcea et al. (2006) used this approach for evaluating text semantic similarity based upon

various word-to-word semantic similarity measures. Therefore, for the sentences S1 and S2 in

the previous example, Sims(S1, S2) = 0.292

< 15 >
3.2.4 Combined Semantic Sentence Similarity Measures

Based upon the previous semantic similarity measures, we can derive combined semantic

measures. In this experiment we consider two combinations:

Simv+o = δSimv + (1 – δ)Simo (10)

and

Sims+o = δSims + (1 – δ)Simo (11)

where δ is a damping factor that decides the contribution of the individual similarity measures

used. Li et al. (2006) suggested that δ should be greater than 0.5 due to the importance of lexical

elements presented in the semantic similarity of Simv and Sims (Wiemer-Hastings, 2000).

As a summary, the aforementioned measures for semantic sentence similarity can be

represented in Figure 1. As can be seen, a set of unique words (S) is first generated from the

input compared sentences. The set of unique words is then compared with the input sentences

to generate raw semantic vector and word order vector for the Simv and Simo approaches

respectively. The raw semantic vector is then transformed into semantic similarity whereas the

word order vector is transformed into order similarity as discussed in the previous sections.

The pair of sentences is also directly processed using equation (8) to generate the highest word-

to-sentence similarity (Sims). The semantic similarity is combined with the order similarity to

produced Simv+o, whereas the same order similarity is combined with the highest word-to-

sentence similarity to produce Sims+o.

<Figure 1>

< 16 >
4 Malay Language Grammatical Structure

The Malay language has been the national language of Malaysia since 1955 and is

formally known as Bahasa Malaysia (Chang, 1980). The basic Malay sentence consists of two

major constituents: subject and predicate. Similar to English, the subject and predicate can be

derived from noun phrases, verb phrases, or adjective phrases. According to Karim (1995),

there are four basic patterns in Malay as summarized in Table 1:

<Table 1>

There are four types of sentences in Malay: declarative, interrogative, imperative, and

exclamatory (Karim, 1995). A declarative sentence usually declares something about the

subject, e.g., “Ahmad pegawai di firma itu” (Ahmad is an officer in that firm). An interrogative

sentence usually poses a question such as “Siapa dia?” (Who is he?). An imperative sentence

makes a request or gives an order like “Keluar dari sini” (Get out of here). An exclamatory

sentence expresses emotion such as joy, surprise, or anger, e.g., “Wah…Besarnya rumah

ini!”(Wow…This house is huge!).

There are four categories of words in Malay: nouns, verbs, adjectives and function

words. Function words in Malay are words in various positions of a sentence that provide a

specific syntactic function, such as conjoining, modifying, emphasizing, negating, and

indicating specificity (Karim, 1995). Examples of function words are dan, jikalau, setelah, and

hanya.

As mentioned earlier, a Malay online dictionary, Kamus Dewan (DBP 2005), was used

as the lexical resource in this study. This dictionary consists of 48,177 Malay words with

69,334 definitions. Necessary steps were taken to filter out the function words, as these words

< 17 >
are not significant in measuring semantic similarities. Other types of words that were filtered

out include abbreviations such as “dll” and “sbg” and symbols such as “~” and “@.” However,

the symbol “-” is considered important due to its role in identifying reduplication words such

as “kanak-kanak” and “layang-layang.” Reduplication of nouns generally gives a semantic

category of heterogeneity or an indefinite plural, while reduplication of verbals results in

semantic features that signify repetition, continuity, habituality, intensity, extensiveness, and

resemblance (Chang, 1980). Affixation also exists in Malay, whereby a base form is extended

by one or more affixes. In Malay, the affixes can be classified as prefixes, suffixes, infixes, and

circumfixes. Morphemes in the form of root words, stems, and affixes are taken into

consideration, as these may provide additional semantic features. In short, semantic similarity

features can be influenced by including or excluding function words, symbols, abbreviations,

and morphemes.

5 Results and Discussion

Testing was conducted for 200 pairs of common Malay sentences. These sentence pairs were

first rated by humans with a value between 0.0 and 1.0, where 0.0 indicates that the sentences

in the pair are not related at all and 1.0 indicates that the sentences are exactly the same or

similar. We decided to select a threshold value of 0.5 to indicate whether the pair of sentences

are semantically similar. The human-rated similarities (sometimes called the “gold standard”)

were then compared with the values derived from the similarity measures described in Section

3. Before proceeding to the analysis of results, we first provide a small walk-through example

of how the results compare for the various similarity measures. Some examples of testing

results are illustrated in Table 2.

< 18 >
Table 2 separates the results into vector-based semantic similarities (Simv), order similarities

(Simo), highest word-to-sentence similarities (Sims), and the combination of similarities Simv

and Simo (Simv+o) and of similarities Sims and Simo (Sims+o), with ∂ = 0.5. The testing illustrated

in Table 2 compares the first sentence of the list with the remaining six sentences. To facilitate

the discussion, we refer to the first sentence of the list as the “target sentence” and the remaining

six sentences as the “compared sentences.” The “human ranking of similarity” is the ranked

given by human to the compared sentence in terms of their similarity to the target sentence .

The results in Table 2 show a consistent outcome between the human similarity ranking and

the automated sentence similarity measures, with very minimal differences.

<Table 2>

Table 3 shows selected results from the initial testing. The intention is to provide the
initial and general outcome of each approach as compared to the human similarity judgements.

<Table 3>

As can be seen in Table 3, the sentences in pair 1 were correctly identified as similar

by all approaches except Simv. Simv is specifically concerned with differences between the

word order vectors, and it seems that this is not enough to produce useful similarity values. The

sentences in pairs 2 and 3 were correctly identified as respectively semantically similar and not

similar by all approaches. However, from the values for pairs 4 and 5, we can see the effect of

the connective terms “kerana” (because) and “kalau” (if). The sentences in pair 4 which should

be identified as similar, however, are decisively identified as not semantically similar by all

approaches. However, changing the word “kerana” to “kalau” in the second sentence of pair 4

to produce pair 5 seems to have a significant effect for the Simo and Sims+o approaches. In the

case of pair 6, the Sims approach wrongly classified the pair as semantically similar. This might

be due to the number of similar words in the two sentences, but they were semantically

irrelevant due to the presence of the different nouns “kedai” (shop) and “sekolah” (school). The
< 19 >
sentences in pair 7 were wrongly classified as semantically similar by the Simo approach, as

the approach focuses on the word order similarity. In turn, the Simv+o and Sims+o values were

influenced by the high Simo similarity values.

Previous work in this area did not consider morphological variants among words.

However, our further observations found that morphological variants do have an impact on

sentence similarity. To illustrate this, consider the following compared sentences and their

similarity measures Simv+o. The underlined words are morphological variants in Malay. In the

first case, “kahwin” (married) is the root word for “berkahwin” (got married), and in the second

case, “baca” (read) is the root word for “membaca” (reading).

S1 = Saya suka lelaki bujang itu. (I like that bachelor man.)

S2 = Saya suka lelaki belum berkahwin itu. (I like that unmarried man.)

S3 = Saya suka lelaki belum kahwin itu. (I like that unmarried man.)

Simv+o (S1, S2) = 0.58; Simv+o (S1, S3) = 0.90

S4 = Saya suka mengaji buku. (I like to recite the book.)

S5 = Saya suka membaca buku. (I like reading the Quran.)

S6 = Saya suka baca buku. (I like to read the Quran.)

Simv+o (S4, S5) = 0.66; Simv+o (S4, S6) = 0.85

As we can see, words that are stemmed to their root words give higher similarity measures.

Therefore, aspects of morphological variants should be considered. However, considering

morphological variants might be a drawback for machine processing .

< 20 >
Based on these “walk-through” observations, we designed our testing so that it

considers the aforementioned issues and elements, i.e., morphological variants of terms and the

effects of connective words (conjunctions), prepositions, and verbs. Table 4 shows the result

of the testing compared with the human judgements. It shows the percentage of accurate

identifications for each approach, or the ability to correctly identify the similarity for all pairs

(usually referred as “recall values”). Concerning the word-to-word similarity methods, the

results clearly show that the probability of intersection provides better outcomes, as evidenced

by experiments 6–10, with the percentage of accurate identifications for all approaches

increased between 5% (the Simo approach) and 10.84% (the Sims approach).

<Table 4>

We summarize the analysis of our results according to the following dimensions:

i. Stemmed or morphological variants: A comparison of the results of Experiments 1


and 2 and of Experiments 6 and 7 indicates that stemming improves accuracy for the
majority of approaches, particularly for Sv and Sv+o.
ii. Conjunctions or connective words: The removal of conjunctions has little impact on
the accuracy of the similarity approaches, as evidenced by a comparison of
Experiments 2 and 3 and of Experiments 7 and 8.
iii. Prepositions: The removal of prepositions seems not to have any impact when using
the normalized approach to word-to-word similarity (Experiments 3 and 4).
However, a slight improvement was obtained when using the probabilistic approach
(Experiments 8 and 9).
iv. Verbs: Verbs play important roles in expressing the meanings of sentences. The
results show that the accuracy decreases for the majority of the approaches under the
probabilistic method when we remove such information but is maintained across
most normalization approaches. This is evidenced by a comparison of Experiments
4 and 5 and of Experiments 9 and 10.

< 21 >
7 Conclusions and Future Work

While research in this area has been dominated by studies of the English language, little

work has been focused on the Malay language. In this paper, we presented the results of our

evaluation of five lexical-based approaches to semantic similarity measures for Malay

sentences. These approaches compare pairs of sentences by first finding the similarity measures

between words. The two proposed word-to-word semantic measures are based on probabilistic

intersections and normalization. Our experiment shows consistent and encouraging results that

indicate the promising potential of applying these approaches to the Malay language.

In our experiment, the Malay MRD lexical database proved to be useful for measuring

word-to-word similarity in the absence of a structured knowledge base similar to WordNet.

The normalization and probabilistic similarity measures achieved a maximum of 59% and 67%

accuracy, respectively, which suggests that the probabilistic method is superior. In addition,

our evaluation shows that identification of morphological variants improves the accuracy of

the semantic similarity measure for Malay sentences, while pronouns and conjunctions have

little effect on improving the accuracy. On average, the normalization and probabilistic

methods respectively show an increase in accuracy of 3.00% and 3.76% among the techniques,

with the highest increase shown by the technique based on combined word order similarity and

word-sentence similarity. On the other hand, the removal of verb information either causes a

deterioration of accuracy or makes no difference for the various approaches. Because of the

former, we can argue that verbs play an important role in contributing meaning to sentences.

Our future research plans include applying the sentence similarity measures to

information retrieval activities involving Malay documents. In addition, the evaluation of

word-to-word similarity should be extended to other methods such as the term co-occurrence

< 22 >
corpus-based method and the semantic network method, which will require the construction of

a linguistic ontology similar to WordNet for the Malay language.

Acknowledgments. The authors wish to thank the Ministry of Higher Education for the funds

provided for this project and also the anonymous referees for their helpful and constructive

comments on this paper.

References

Abdullah, M. T., Ahmad, F., Mahmod, R. T., and Sembok, T. M. (2003). Evaluating the
effectiveness of thesaurus and stemming methods in retrieving Malay translated al-Quran
documents. In T. M. T. Sembok, H. B. Zaman, H. Chen, S. R. Urs, and S.-H. Myaeng
(eds.), Proceedings of the 6th International Conference on Asian Digital Libraries,
(ICADL) 2003, Kuala Lumpur, Malaysia, pp. 663–665.

Ahmad, F., Yusoff, M., and Sembok, T. M. (1996). Experiments with a stemming algorithm
for Malay words. Journal of the American Society of Information Science, 47(12), 909–
18.

Aliguliyev, R.M. (2009). A new sentence similarity measure and sentence based extractive
technique for automatic text summarization. Expert Systems with Applications, 36, 7764–
7772.

Bollegala, D., Matsuo, Y., and Ishizuka, M. (2011). A Web Search Engine-Based Approach to
Measure Semantic Similarity between Words. IEEE Trans. Knowl. Data Eng., 23(7),
977-990.

Buitelaar, P., Cimiano, P., and Magnini, B. (2005). Ontology learning from text: An
overview. In P. Buitelaar, P. Cimiano, and B. Magnini (eds.), Ontology Learning from
Text: Methods, Evaluation and Applications, pp. 1–9. Amsterdam: IOS Press.

Castillo, J.J. and Cardenas, M. E. (2010). Using Sentence Semantic Similarity Based on
WordNet in Recognizing Textual Entailment. In A. Kuri-Morales and G. R. Simari
(eds.), Advances in Artificial Intelligence – 12th Ibero-American Conference on AI
(IBERAMIA), Bahía Blanca, Argentina, pp. 366-375.

Chang, M. S. (1980). The morphological analysis of Bahasa Malaysia. In :Proceedings of the


8th Conference on Computational Linguistics, Penang, Malaysia, pp. 578–85.

Cilibrasi, R. and Vitanyi, P. M. B. (2006). Similarity of objects and the meaning of words. In
J-Y Chai, S. B. Cooper and A. Li (eds.), Proceedings of the 3rd Conference on Theory
and Applications of Models of Computation (TAMC), Beijing, China, pp. 21–45.

Egozi, O., Markovitch, S. and Gabrilovich, E. (2011). Concept-Based Information Retrieval


Using Explicit Semantic Analysis. ACM Transactions on Information Systems, 29(2), 8:1-8:34
< 23 >
Kamus Dewan. (2005). Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia.

Karim, N. S. (1995). Malay Grammar for Academics and Professionals. Kuala Lumpur:
Dewan Bahasa dan Pustaka.

Karov, Y. and Edelmen, S. (1998). Similarity-based word sense disambiguation.


Computational Linguistics, 24(1), 41–59.

Ko, Y., Park, J., and Seo, J. (2004). Improving text categorization using the importance of
sentences. Information Processing and Management, 40(1), 65–79.

Kong, T. E. and Yusoff, Z. (1995). Natural language analysis in machine translation (MT)
based on the string-tree correspondence grammar (STCG). In Paper presented at the
10th Pacific Asia Conference on Language, Information and Computation (PACLIC10).

Leacock, C. and Chodorow, M. (1998). Combining local context and WordNet sense
similarity for word sense identification. In C. Fellbaum (ed.), WordNet, an Electronic
Lexical Database, pp. 305–332. Boston: The MIT Press.

Lee, M. C. (2011). A novel sentence similarity measure for semantic-based expert systems.
Expert Systems with Applications, 38(5), 6392–6399

Lemaire, B. and Denhière, G. (2006). Effects of high-order co-occurrences on word semantic


similarity. Current Psychology Letters 18(1). http://cpl.revues.org/document471.html.

Lesk, M. E. (1986). Automatic sense disambiguation using machine readable dictionaries:


How to tell a pine cone from an ice cream cone. In V. DeBuys (ed.), Proceedings of the
5th annual international conference on systems documentation, University of Toronto,
Canada, pp. 24–26.

Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., and Crockett, K. (2006). Sentence
similarity based on semantic nets and corpus statistics. IEEE Transactions on
Knowledge and Data Engineering, 18(8), 1138–1150.

Liu, H. and Wang, P. (2013). Assessing Sentence Similarity Using WordNet based Word
Similarity. Journal of Software, 8(6), 1451-1458.

Liu, S., Liu, F., Yu, C., and Meng, W. (2004). An effective approach to document retrieval
via utilizing WordNet and recognizing phrases. In K. Jarvelin, J. Allan, P. Bruza and M.
Sanderson (esd.), Proceedings of the 27th Annual International ACM SIGIR
Conference, Sheffield, UK, pp. 266–72.

Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., and Zobel, J. (2005). Similarity
measures for tracking information flow. In O. Herzog, H-J Schek, N. Fuhr, A.
Chowdhury, and W. Teiken (eds.), Proceedings of the CIKM’05, Bremen, Germany, pp.
571–524.

Mihalcea, R., Corley, C., and Strapparave, C. (2006). Corpus based and knowledge based
measures of text semantic similarity. In A. Cohn (ed.), Proceedings of the American
Association for Artificial Intelligence (AAAI 2006), Boston, Massachusetts, pp. 775–
780.
< 24 >
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM,
38(11), 39–41.

Moidin, A. H. (2008). Sinonim A-Z untuk Pelajar. IBS: Kuala Lumpur.

Noah, S. A., Amruddin, A. Y., and Omar, N. (2007). Semantic similarity measures for Malay
sentences. In D. H-L Goh, T. H. Cao, I. Sølvberg, and E. M. Rasmussen (eds.),
Proceedings of the ICADL 2007, Hanoi, Vietnam, pp. 117–26.

Noor, N. K. M., Noah, S. A., Aziz, M. J. A. and Hamzah, M. P. (2012). Malay Anaphor and
Antecedent Candidate Identification: A Proposed Solution. In J-S. Pan, S-M. Chen, N. T.
Nguyen (eds.) Proceedings of the Asia Conference on Intelligent Information and Database
Systems (ACIIDS) (3), Kaohsiung, Taiwan, pp. 141-151

O'Shea, K. (2012). An approach to conversational agent design using semantic sentence


similarity. Applied Intelligence, 37(4), 558-568.

Othman, A. (1993). Pengakar perkataan melayu untuk sistem capaian dokumen. MSc Thesis.
National University of Malaysia, Bangi, Malaysia.

Qiu, G., Bu, J., Chen, C., Huang, P., and Cai, K. (2007). Syntactic impact on sentence
similarity measure in archive-based QA system. In J. Pei, V. S. Tseng, L. Cao, H.
Motoda, G. Xu (eds.), Proceedings of 11th Asia Pacific Conference on Advances in
Knowledge Discovery and Data Mining, Gold Coast, Australia, pp. 769–76.

Resnik, P. (1995). Using information content to evaluate the semantic similarity. In C. S.


Mellish (ed.), Proceedings of the 14th International Joint Conference on Artificial
Intelligence, Montreal, Canada, pp. 448–453.

Rocchio, J. J. Jr (1971). Relevance feedback in information retrieval. In G. Salton (ed), The


Smart Retrieval Systems - Experiments in Automatic Document Processing. New
Jersey: Prentice-Hall.

Salton, G., and Lesk, M. (1971). Computer evaluation of indexing and text processing.
Journal of the ACM, 15(1), 8-36.

Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24(1),


97–124.

Sparck-Jones, K. 1972. A statistical interpretation of term specificity and its application in


retrieval. Journal of Documentation, 28(1), 11–21.

Turney, P. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In L. D.
Raedt and P. A. Flach (eds.), Proceedings of the 12th European Conference on Machine
Learning, Freiburg, Germany, pp. 491-502.

Verbene, S. 2007. Paragraph retrieval for why-question answering. In W. Kraaij, A. P. de


Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (eds.), Proceedings of the 30th Annual
International ACM SIGIR Conference, Amsterdam, The Netherlands, pp. 922–922.

< 25 >
Verberne, S., Boves, L., Oostdijk, N., and Coppen, P.-A. (2008). Evaluating paragraph
retrieval for why–QA. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, R. W.
White (ed.), Proceedings of the 30th European Conference on IR Research, ECIR 2008,
Glasgow, UK, pp. 669–73.

Wiemer-Hastings, P. (2000). Adding syntactic information to LSA. In L. R. Gleitman and A.


K. Joshi (eds.), Proceedings of the 22nd Annual Conference on Cognitive Science,
Pennsylvania, US, pp. 989–93.

Zeng, H-J., He, Q-C., Chen, Z., Ma, W-Y., and Ma, J. 2004. Learning to cluster web search
results. In M. Sanderson, K. Järvelin, J. Allan, and P. Bruza (eds.), Proceedings of the
27th Annual International ACM SIGIR Conference, Sheffield, UK, pp. 210-217.

Zhang, Z. Q., Gentile, A. N., and Ciravegna, F. (2012). Recent advances in methods of lexical
semantic relatedness – a survey. Natural Language Engineering, 19(4), 411–479.

Zhongcheng, Z. (2009). Measuring semantic similarity based on WordNet. In Sixth Web


Information Systems and Applications Conference, Xuzhou, Jiangsu, China, pp. 88-92.

< 26 >
Figure 1. Distribution of word-to-word similarity measures for synonyms, probabilistic

method
Figure 2. Distribution of word-to-word similarity measures for synonyms, normalization

method
Table 1. Basic sentence patterns in Malay

Pattern Subject Predicate

Pattern (1) Noun Phrase (NP) Subject + Noun Phrase (NP) Predicate

e.g., Encik Ahmad guru sekolah

Pattern (2) Noun Phrase (NP) Subject + Verb Phrase (VP) Predicate

e.g., Bapa mereka sedang berbual

Pattern (3) Noun Phrase (NP) Subject + Adjective Phrase (AP) Predicate

e.g., Anak itu sihat sungguh

Pattern (4) Noun Phrase (NP) Subject + Prepositional Phrase (PP) Predicate

e.g., Pejabat saya di Kuala Lumpur


Table 2. Initial testing results (using cross probability method for unstemmed word-to-word

similarity)

Sentence Sentences Compared Human Simv Simo Simv+o Sims Sims+o


Tested Similarity
Ranking
Target sentence 1
Saya pergi ke sekolah.
(I went to school.)
Saya pergi ke sekolah. 1 1.00 1.00 1.00 1.00 1.00
(I went to school.)
Saya berjalan ke sekolah. 2 0.88 1.00 0.94 0.81 0.90
(I walked to school.)
Saya pergi ke madrasah. 3 0.66 1.00 0.83 0.73 0.87
(I went to the religious school.)
Saya pergi ke kedai. 4 0.44 0.40 0.42 0.66 0.53
(I went to a shop.)
Dia pergi ke kedai. 5 0.37 0.41 0.39 0.53 0.47
(He went to a shop.)
Saya makan nasi di kedai. 6 0.16 0.39 0.27 0.29 0.32
(I ate rice at a restaurant.)

Target sentence 2
Saya membaca buku sambil minum air kopi.
(I read a book while drinking coffee.)
Saya membaca buku sambil 1 1.00 1.00 1.00 1.00 1.00
minum air kopi.
(I read a book while drinking
coffee.)
Saya membaca buku sambil 2 0.76 0.75 0.76 0.87 0.81
minum air teh.
(I read a book while drinking tea.)
Saya membelek majalah 3 0.39 0.67 0.53 0.68 0.62
sambil minum air teh.
(I skimmed a magazine while
drinking tea.)
Saya menonton televisyen 4 0.41 0.67 0.54 0.67 0.61
sambil minum air teh.
(I watched television while
drinking tea.)
Ahmad menonton televisyen 5 0.34 0.70 0.52 0.44 0.57
sambil minum air teh.
(Ahmad watched television while
drinking tea.)
Saya menonton televisyen 6 0.17 0.22 0.19 0.31 0.26
sambil baring.
(I lay down and watched
television.)

Target Sentence 3
Komputer riba sangat ringan.
(Laptops are very light.)
Komputer riba sangat 1 1.00 1.00 1.00 1.00 1.00
ringan.
(Laptops are very light.)
Komputer riba amat ringan. 2 0.95 0.58 0.76 0.83 0.70
(Laptops are extremely light.)
Komputer riba sangat berat. 3 0.91 1.00 0.95 0.82 0.91
(Laptops are very heavy.)
Kalkulator kecil sangat 4 0.47 0.74 0.61 0.52 0.63
ringan.
(Small calculators are very light.)
Mesin kira sangat ringan. 5 0.71 0.80 0.75 0.70 0.75
(Calculating machines are very
light.)
Meja komputer amat berat. 6 0.39 0.61 0.50 0.33 0.47
(Computer tables are very heavy.)

Target Sentence 4
Agensi kerajaan Malaysia
(Malaysia government agency)
Agensi kerajaan Malaysia 1 1.00 1.00 1.00 1.00 1.00
(Malaysia government agency)
Agensi kerajaan Cina 2 0.81 0.31 0.56 0.69 0.50
(China government agency)
Agensi negara Malaysia 3 0.96 0.89 0.92 0.81 0.85
(Country of Malaysia agency)
Agen negara asing 4 0.36 0.52 0.44 0.25 0.38
(Foreign country agency)
Agen kerajaan Malaysia 5 0.49 0.73 0.61 0.67 0.70
(Malaysia government agent)
Agensi bangsa Malaysia 6 0.87 0.59 0.73 0.72 0.66
(Malaysian tribe agency)
Table 3: Selected initial testing results
Sentence Pairs Human Simv Simo Simv+o Sims Sims+o
similarity
judgement
1Saya main bola. 0.76 0.44 0.61 0.52 0.61 0.60
(I play with the ball.)

Saya tendang bola.


(I kick the ball.)
2Budi bahasa budaya kita. 0.67 0.57 0.70 0.63 0.54 0.62
(Politeness is our culture.)

Akhlak mulia budaya kita.


(Noble behavior is our culture.)
3Budi bahasa budaya kita. 0.32 0.39 0.12 0.26 0.33 0.23
(Politeness is our culture.)

Budaya Melayu unik sekali.


(Malay culture is very unique.)
4Senaman amalan sihat. 0.70 0.26 0.24 0.25 0.27 0.26
(Exercise is a healthy activity.)

Badan sihat kerana bersenam.


(Good health from exercise.)
5Senaman amalan sihat. 0.64 0.36 0.58 0.47 0.42 0.50
(Exercise is a healthy activity.)

Badan sihat kalau bersenam.


(Good health would result from
exercise.)
6Saya pergi ke sekolah. 0.16 0.44 0.40 0.42 0.66 0.53
(I went to school.)

Saya pergi ke kedai.


(I went to a shop.)
7Saya membaca buku sambil 0.19 0.34 0.70 0.52 0.44 0.57
minum air kopi.
(I read books while drinking
coffee.)

Ahmad menonton television


sambil minum air the.
(Ahmad watched television while
drinking tea.)
Table 4. Results of the approaches: percentage of correctly identified sentence similarities

compared to the human standard

Word-to- Exp # Experiment Simv Simo Simv+o Sims Sims+o


word details
similarity
method
1 Unchanged 54.12 56.67 53.33 55.83 57.50
Normalized with ξ = 0.37

2 Stemmed 59.17 57.50 60.00 58.33 57.50


3 Stemmed; without 59.12 58.33 60.83 59.12 58.33
conjunctions
4 Stemmed, without 59.17 58.33 60.83 59.12 58.33
conjunctions and
prepositions
5 Stemmed; without 59.17 58.33 58.33 60.00 58.33
conjunctions,
prepositions, and
verbs

6 Unchanged 61.33 61.67 61.67 66.67 65.83


7 Stemmed 67.50 62.50 66.67 66.67 65.83
8 Stemmed; without 67.50 63.33 66.67 67.50 67.50
conjunctions
with ξ = 0.18
Probabilistic

9 Stemmed, without 67.50 65.00 67.50 68.33 67.50


conjunctions and
prepositions
10 Stemmed, without 65.83 63.33 65.00 68.33 65.00
conjunctions,
prepositions, and
verbs

View publication stats

You might also like