Professional Documents
Culture Documents
JQL-Noah-MalayReviewRG
JQL-Noah-MalayReviewRG
JQL-Noah-MalayReviewRG
net/publication/273836531
CITATIONS READS
10 1,891
3 authors:
Amru Yusrin
Universiti Kebangsaan Malaysia
2 PUBLICATIONS 25 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Shahrul Azman Mohd Noah on 06 August 2015.
Authors:
Shahrul Azman Noah, Nazlia Omar, and Amru Yusrin Amruddin
Knowledge Technology Research Group
Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia
43600 UKM Bangi Selangor
Malaysia
Corresponding author:
Shahrul Azman Noah
Faculty of Information Science & Technology
Universiti Kebangsaan Malaysia
43600 UKM Bangi Selangor
Malaysia
e-mail: shahrul@ukm.edu.my; samn.ukm@gmail.com
tel.: +6-03-89216343, +6-013-3306626
fax: +6-03-89256732
Evaluation of Lexical-Based Approaches to the Semantic
mainly used for English sentences and no studies to date have evaluated and
<2 >
1 Introduction
retrieval and natural language processing (Egozi, Markovitch, & Gabrilovich 2011; Zhang,
Gentile & Ciravegna. 2012). Similarity measures are functions that calculate numeric values
associating (pairs of) objects. For example, a vector-space document retrieval system measures
the distance between a query and documents in the system. Here, a shorter distance indicates
greater similarity. The use of external sources such as domain corpora and lexical databases to
improve similarity measures using semantic meanings are some common approaches (Liu, Yu
& Meng, 2004). Most of the previous work on measuring similarity (Lemaire & Denhiere 2006;
Bollegala, Matsuo & Ishizuka, 2011), however, has been based purely on word and phrase
similarity, thus ignoring the semantics that such words have in the context of the surrounding
set of sentences or terms within term lists are assigned a metric based on the likeness of their
meaning content (Cilibrasi & Vitanyi, 2006). The emphasis on word-to-word similarity metrics
is undoubtedly due to the availability of resources that explicitly specify the relations among
words, such as WordNet (ZoongCheng, 2009; Liu & Wang 2013). Although sentence similarity
based on word similarity in applications such as paragraph retrieval (Verbene 2007; Verbene,
Boves, Oostdijk & Coppen, 2008), conversational agents (O'Shea, 2012), expert systems (Lee,
(Metzler, Bernstein, Croft, Moffat & Zobel, 2005; Castillo and Cardenas, 2010; Li, McLean,
Bandar, O’Shea, & Crockett, 2006) and have reported encouraging results. However, there
<3 >
have been no investigations to date that compared approaches to sentence similarity for the
Malay language. The Malay language is an Austronesian language spoken by the Malay people
and people of other races who reside in the Malay Peninsula, southern Thailand, the
Philippines, Singapore, central eastern Sumatra, the Riau islands, and parts of the coast of
Borneo. It has been an official language of Malaysia, Brunei, Singapore, Indonesia, and East
Timor. Therefore, there are many paper and digital documents written in the Malay language
and there is a great need for systems and algorithms to process such documents. In this paper,
word semantic vectors, and word-to-word similarity. Section 2 provides the background
knowledge for our experiment and an overview of related research. Section 3 describes our
method for measuring the semantic similarity of Malay sentences and the evaluative
experiment that we conducted. Section 4 reports the results of our experiment, and Section 5
mainly at the word or concept level, with many fewer at the sentential level. A similarity
function for sentence similarity will, given two sentences (or segments), generate a score that
indicates their relatedness. Most sentence similarity measures, however, are mainly concerned
with “calculating” the presence or absence of words in the compared sentences, and popular
methods include word overlap measures, term frequency–inverse document frequency (TF-
IDF) measures, relative frequency measures, and probabilistic models. Semantic sentence
<4 >
for a pair of sentences that indicates their similarity at the semantic level. Sentence similarity
has been reported to be useful in applications such as question-answering systems (Qui, Bu,
Chen, Huang & Cai, 2007), text categorization (Ko, Park & Seo 2004), and paraphrase
recognition (Mihalcea, Corley & Strapparave, 2006). Mihalcea et al. (2006), for example,
reported that their experiment shows that a semantic sentence similarity measure outperforms
semantic measures for the compared sentences, and then scoring functions can be used to
generate the similarity value between the sentences. A relatively large number of word-to-word
similarity measures have previously been proposed in the literature. According to Mihalcea et
al. (2006), these fall into two groups: corpus-based measures and knowledge-based measures.
Corpus-based measures of semantic word similarity seek to identify the similarity between
words using information derived from large corpora (Turney, 2001; Karov & Ederment, 1998).
Turney (2001) proposed the “pointwise mutual information measure,” which was based on
term co-occurrence counts over large corpora. Another popular approach is latent semantic
reduction using singular value decomposition (SVD). Knowledge-based measures, on the other
hand, identify semantic similarity between words by using information from a dictionary or a
thesaurus to calculate degrees of relatedness among words. For example, Leacock and
Chodorow’s (1998) method counts the number of nodes on the shortest path between two
concepts in WordNet. Resnik (1995) and Li, McLean, Bandar, O’Shea, and Crockett (2006)
also use WordNet to calculate semantic measures. Lesk’s (1986) method defines semantic
similarity between two words based on overlap measures between the corresponding dictionary
definitions.
<5 >
Experiments on semantic sentence similarity for English have shown promising results.
Mihalcea et al. (2006) proved that incorporating semantic information into measures of
sentence similarity significantly increased the recognition likelihood as compared to the vector-
based cosine similarity. They experimented with the corpus-based and knowledge-based
approaches. In corpus-based approach, the degree of similarity between words was derived
from large corpora whereas in knowledge-based approach such similarity was derived from
WordNet from using several measures such as Leacock & Chodorow and Lesk. The work by
Li et al. (1986) proposed a semantic sentence similarity measure using WordNet and corpus
statistics. The similarity measure is based on semantic and order information. Detail
explanation about this method is discussed in the next section of this paper. Their work focused
on short sentences which are featured in applications such as conversational agents and
dialogue systems. Results from their experiments showed that the proposed method provides
of the earliest applications of text similarity is probably the vector space model of information
retrieval. In this case, the relevancy of documents to a given user query is determined by
ranking algorithms that measure the similarity of the query vector and the documents vector
(Salton & Lesk 1971). Since then, text similarity has gained research interest in various
application such as relevance feedback and text classification (Rocchio 1971), and word sense
Recently with the advancement of information retrieval field and availability of large
textual corpus and knowledge sources, semantic sentence similarity have receive attention in
sentences play an important role in finding similar questions in the archive of users’ request.
Qiu et al. (2007) for example showed that how syntactic information embedded in similarity
<6 >
measures could overcome some other base-line retrieval models. In textual recognition,
recognitions and showed that they outperform other vectorial based models.
There is presently no research that compares semantic sentence similarity measures for
Malay sentences. There is some general research on Malay document retrieval. Ahmad, Yusoff,
and Sembok (1996) and Othman (1993), for instance, proposed algorithms for the stemming
of Malay words, Abdullah, Ahmad, Mahmod and Sembok (2003) applied the latent semantic
index approach to Malay-English cross-language document retrieval, Kong and Yusoff (1995)
made an effort towards English-Malay machine translation and recently Noor, Noah, Aziz and
Hamzah (2012) investigate methods for anaphora detection in Malay text. But so far, no
automatically generates a value that indicates their similarity. The comparison of S1 and S2 is
usually done by means of word-to-word similarity measures among the constituent words in S1
and S2. Therefore, assuming that S1 and S2 can be represented as finite vectors of words {w1,
w2, w3,…,wm} and {v1, v2, v3,…,vn}, respectively, a number of possible scoring functions
proposed in the literature can be applied. The simplest would be to consider all the possible
mainly inspired by the work of Mihalcea et al. (2006) and Li et al. (2006). However, before
<7 >
describing the experiment, we first discuss the approaches to measuring the semantic similarity
between words.
based measures, as there currently exists no large corpus for Malay language sources.
Furthermore, as a linguistic database for the Malay language similar to WordNet is not yet
available, we chose to use an existing lexical dictionary. The lexical dictionary contains 69,344
rows of data with 48,177 Malay words, based on the 4th edition of the Kamus Dewan (“Dewan
However, the dictionary is not yet available in a machine readable dictionary (MRD) format—
i.e., the dictionary is available only in in a human readable format—so some preprocessing was
required. The dictionary was parsed by filtering and eliminating symbols, short-form words,
the only suitable method for our purpose was Lesk’s (1986) method. This is due to the nature
of the generated MRD dictionary, which only contains meanings of words and not the
hierarchical structure of words that models human common-sense knowledge about general
Using Lesk’s method, the similarity sim(w1, w2) of words w1 and w2 can be calculated
<8 >
where M denotes the meaning of the subscripted word, C is the set of unique overlap words
found in the meanings of w1 and w2, and 𝑃𝑃�𝑀𝑀𝑤𝑤1 |𝐶𝐶� refers to the probability of the meaning of
word w1 containing an instance of C. The normalization method, on the other hand, is based on
As can be seen from equation (3), the normalization method is very similar to the probabilistic
method, except that the probabilities for the meanings of word w1 and w2 are normalized.
Follows illustrate the calculation of both word-to-word semantic similarity methods. Assume
that we want to find the similarity between the words sekolah (school) and madrasah (religious
school). Referring to the MRD there are eight unique overlap words between sekolah and
madrasah (i.e. C = 10), and the total number of unique words in the meaning of sekolah and
By using equation (2) and (3) respectively, we will obtain simprob(sekolah, madrasah) = 0.606,
The derived word-to-word semantic similarity values discussed above are used to
define the semantic similarity values between two sentences. A comparison of the semantic
similarity between two sentences can be implemented into two ways, either by comparing each
word in sentence S1 with all the words in sentence S2 and generating the similarity values based
on these word-to-word similarities or by constructing a joint distinct word set for the two
sentences. Assuming that we are comparing sentences S1 and S2, a set of distinct joint words S
<9 >
S = S1 ∪ S2,where S1 = {w1, w2, w3,…,wm} and S2 = {v1, v2, v3,…,vn}
For example, assuming that we have the sentences S1: Saya berjalan ke sekolah (I walked to
school) and S2: Dia berkereta ke bandar (He drove to town), then we will have S = {saya
berjalan ke sekolah dia berkereta bandar}. The joint word set S is used to derive the various
semantic measures.
In this experiment, we consider four measures for semantic sentence similarity: word
order similarity, highest word-to-sentence similarity, semantic sentence similarity, and a hybrid
of word order similarity and semantic sentence similarity. These measures require a measure
a threshold for deciding whether compared words or terms are semantically similar, an
experiment involving 200 pairs of synonyms was conducted. The synonyms were derived
based on Moidin (2008). A pair of terms ti and tj are considered similar if sim(ti, tj) > ξ, while
similarity values less than ξ are considered to indicate that the words are not semantically
similar. For the word-to-word similarity methods considered in this study, i.e., the probabilistic
and normalization methods, ξ = 0.18 and ξ = 0.37 were selected as the respective threshold
values. The threshold values were empirically derived, whereby when using the probability
methods, word similarity values of less than 0.18 and 0.37 for the probability and normalization
The similarity method based on semantic vectors (Simv) uses the joint word set S as a
basis to derive semantic information about the compared sentences S1 and S2. The joint word
set S is viewed as providing the semantic information for the compared sentences. There is an
open question about whether to consider morphological variants. Li et al. (2006) do not
consider morphological variants. In this case, the Malay words for makan (eat), makanan
< 10 >
(food), and pemakanan (nutrition) are considered to be three unique words and can all appear
in the joint set S. However, Noah, Amruddin, and Omar (2007) argue that morphological
variations among words play a significant role in deriving sentence similarity values as shown
in their simple experiment. We consider both cases in this experiment and will discuss in the
Results section the effects these cases have on determining the similarity values.
To derive the semantic information content of S1 and S2, term-term matrices for the two
where xi,j represents the similarity measure between the ith word qi in the compared sentence
and the jth word wj of the joint word set S. The value of xi,j = 1 if qi and wj are the same word,
whereas if qi ≠ wj, the similarity measure is computed using the previously described word-to-
The raw semantic vector š for Si (i = 1,2) can then be computed with š = {max(x1,1, …,
xm,1), …, max(x1,n, …, xm,n)}. For example, if S1 = {negara, Malaysia, aman, sentosa} and S2 =
{negara jepun maju}, then we have S = {negara, Malaysia, aman, sentosa, jepun, maju}.
Comparing the joint set S with S1 and S2, we will obtain the following term-term matrix
respectively:
1 0 0 0 0 0
0 1 0 0 0 0
�0 0 1 0 0 0�
0 0 0 1 0 0
< 11 >
0 0 0 0 1 0
�1 0.667 0 0 0 0�
0 0 0 0 0 1
and therefore, the raw semantic vector š for S1 and S2 will be {1, 1, 1, 1, 0, 0} and {1, 0.667, 0,
0, 0, 0} respectively.
For the calculation of the semantic vector si, the following formula is then used:
value of I(w), which is the weight of word w, is calculated with reference to the MRD
log(n + 1)
I ( w) = 1 − (5)
log( N + 1)
where n is the number of rows of meaning containing the word w and N is the total number of
to contribute to the similarity based on their individual information contents (Li et al. 2006).
By using equations (4) and (5), we will obtained the following semantic vector of s1 and s2.
< 12 >
Finally, the semantic similarity between the two compared sentences is simply the cosine
s1 ⋅ s2
Simv ( S1 , S 2 ) = cos( s1 , s2 ) = (6)
s1 × s2
The measure for word order similarity (Simo) is a straightforward process based on the
distinct joint word set S. Assuming that we have the following pair of sentences S1 and S2:
we will have the joint word set S = {negara, Malaysia, aman, sentosa, Jepun, maju}. Similarly
to semantic similarity, the word order vector is derived from the joint set S. A term-term matrix
is constructed and the word-to-word similarity measure is calculated using the method
discussed in section 3.2.1. The resulting matrix for the sentence S1 and the joint set S is similar
to the one presented in section 3.2.1, but for readability we present it again as follows:
< 13 >
Malaysia
sentosa
S
negara
Jepun
aman
maju
S1
negara 1 0 0 0 0 0
Malaysia 0 1 0 0 0 0
aman 0 0 1 0 0 0
sentosa 0 0 0 1 0 0
u1 = (1 2 3 4 0 0)
The word order vector u1 for S1 is constructed based on the joint existence or the highest word-
to-word similarity between the joint set S and S1. Therefore we have u1 = (1 2 3 4 0 2); the last
value of u1 is equal to 2 because the word maju in S is strongly similar to the word Malaysia,
which is the second position in S1. Similarly, we have u2 = (2 2 3 3 1 3) derived from the
following matrix:
Malaysia
sentosa
S
negara
Jepun
aman
Maju
S2
Jepun 0 0 0 0 1 0
negara 1 0.667 0 0 0 0
Maju 0 0 0 0 0 1
u2 = (2 2 0 0 1 3)
u1 − u2
Simo ( S1 , S 2 ) = 1 − (7)
u1 + u2
we have Simo(S1, S2) = 0.828. The word order similarity in (7) is determined by the normalized
< 14 >
3.2.3 Semantic Similarity Measures based on Highest Word-to-Sentence Similarity
The highest word-to-sentence similarity approach (Sims) requires comparing each word
in each sentence with all the words in the other sentence. The similarity between the sentences
S1 and S2 is based upon the maximum word-to-word similarity between each word w in S1 and
the words in S2 and vice versa. The similarity measure is therefore calculated using the
following equation:
where masSim(w,S2) identifies the word w in S1 that has the highest similarity with the word in
segment S1 and masSim(w,S2) determines the most similar word in S1 starting with words in S2.
idf(w) measures the specificity of word w using the classic inverse document frequency (idf)
introduced by Sparck-Jones (1972) represented as follows, where N is the total number items
in the collection and dfw is the number items in the collection that has the word w.
𝑁𝑁
𝑖𝑖𝑖𝑖𝑖𝑖(𝑤𝑤) = (9)
𝑑𝑑𝑑𝑑𝑤𝑤
Mihalcea et al. (2006) used this approach for evaluating text semantic similarity based upon
various word-to-word semantic similarity measures. Therefore, for the sentences S1 and S2 in
< 15 >
3.2.4 Combined Semantic Sentence Similarity Measures
Based upon the previous semantic similarity measures, we can derive combined semantic
and
where δ is a damping factor that decides the contribution of the individual similarity measures
used. Li et al. (2006) suggested that δ should be greater than 0.5 due to the importance of lexical
elements presented in the semantic similarity of Simv and Sims (Wiemer-Hastings, 2000).
represented in Figure 1. As can be seen, a set of unique words (S) is first generated from the
input compared sentences. The set of unique words is then compared with the input sentences
to generate raw semantic vector and word order vector for the Simv and Simo approaches
respectively. The raw semantic vector is then transformed into semantic similarity whereas the
word order vector is transformed into order similarity as discussed in the previous sections.
The pair of sentences is also directly processed using equation (8) to generate the highest word-
to-sentence similarity (Sims). The semantic similarity is combined with the order similarity to
produced Simv+o, whereas the same order similarity is combined with the highest word-to-
<Figure 1>
< 16 >
4 Malay Language Grammatical Structure
The Malay language has been the national language of Malaysia since 1955 and is
formally known as Bahasa Malaysia (Chang, 1980). The basic Malay sentence consists of two
major constituents: subject and predicate. Similar to English, the subject and predicate can be
derived from noun phrases, verb phrases, or adjective phrases. According to Karim (1995),
<Table 1>
There are four types of sentences in Malay: declarative, interrogative, imperative, and
exclamatory (Karim, 1995). A declarative sentence usually declares something about the
subject, e.g., “Ahmad pegawai di firma itu” (Ahmad is an officer in that firm). An interrogative
sentence usually poses a question such as “Siapa dia?” (Who is he?). An imperative sentence
makes a request or gives an order like “Keluar dari sini” (Get out of here). An exclamatory
sentence expresses emotion such as joy, surprise, or anger, e.g., “Wah…Besarnya rumah
There are four categories of words in Malay: nouns, verbs, adjectives and function
words. Function words in Malay are words in various positions of a sentence that provide a
indicating specificity (Karim, 1995). Examples of function words are dan, jikalau, setelah, and
hanya.
As mentioned earlier, a Malay online dictionary, Kamus Dewan (DBP 2005), was used
as the lexical resource in this study. This dictionary consists of 48,177 Malay words with
69,334 definitions. Necessary steps were taken to filter out the function words, as these words
< 17 >
are not significant in measuring semantic similarities. Other types of words that were filtered
out include abbreviations such as “dll” and “sbg” and symbols such as “~” and “@.” However,
the symbol “-” is considered important due to its role in identifying reduplication words such
semantic features that signify repetition, continuity, habituality, intensity, extensiveness, and
resemblance (Chang, 1980). Affixation also exists in Malay, whereby a base form is extended
by one or more affixes. In Malay, the affixes can be classified as prefixes, suffixes, infixes, and
circumfixes. Morphemes in the form of root words, stems, and affixes are taken into
consideration, as these may provide additional semantic features. In short, semantic similarity
and morphemes.
Testing was conducted for 200 pairs of common Malay sentences. These sentence pairs were
first rated by humans with a value between 0.0 and 1.0, where 0.0 indicates that the sentences
in the pair are not related at all and 1.0 indicates that the sentences are exactly the same or
similar. We decided to select a threshold value of 0.5 to indicate whether the pair of sentences
are semantically similar. The human-rated similarities (sometimes called the “gold standard”)
were then compared with the values derived from the similarity measures described in Section
3. Before proceeding to the analysis of results, we first provide a small walk-through example
of how the results compare for the various similarity measures. Some examples of testing
< 18 >
Table 2 separates the results into vector-based semantic similarities (Simv), order similarities
(Simo), highest word-to-sentence similarities (Sims), and the combination of similarities Simv
and Simo (Simv+o) and of similarities Sims and Simo (Sims+o), with ∂ = 0.5. The testing illustrated
in Table 2 compares the first sentence of the list with the remaining six sentences. To facilitate
the discussion, we refer to the first sentence of the list as the “target sentence” and the remaining
six sentences as the “compared sentences.” The “human ranking of similarity” is the ranked
given by human to the compared sentence in terms of their similarity to the target sentence .
The results in Table 2 show a consistent outcome between the human similarity ranking and
<Table 2>
Table 3 shows selected results from the initial testing. The intention is to provide the
initial and general outcome of each approach as compared to the human similarity judgements.
<Table 3>
As can be seen in Table 3, the sentences in pair 1 were correctly identified as similar
by all approaches except Simv. Simv is specifically concerned with differences between the
word order vectors, and it seems that this is not enough to produce useful similarity values. The
sentences in pairs 2 and 3 were correctly identified as respectively semantically similar and not
similar by all approaches. However, from the values for pairs 4 and 5, we can see the effect of
the connective terms “kerana” (because) and “kalau” (if). The sentences in pair 4 which should
be identified as similar, however, are decisively identified as not semantically similar by all
approaches. However, changing the word “kerana” to “kalau” in the second sentence of pair 4
to produce pair 5 seems to have a significant effect for the Simo and Sims+o approaches. In the
case of pair 6, the Sims approach wrongly classified the pair as semantically similar. This might
be due to the number of similar words in the two sentences, but they were semantically
irrelevant due to the presence of the different nouns “kedai” (shop) and “sekolah” (school). The
< 19 >
sentences in pair 7 were wrongly classified as semantically similar by the Simo approach, as
the approach focuses on the word order similarity. In turn, the Simv+o and Sims+o values were
Previous work in this area did not consider morphological variants among words.
However, our further observations found that morphological variants do have an impact on
sentence similarity. To illustrate this, consider the following compared sentences and their
similarity measures Simv+o. The underlined words are morphological variants in Malay. In the
first case, “kahwin” (married) is the root word for “berkahwin” (got married), and in the second
S2 = Saya suka lelaki belum berkahwin itu. (I like that unmarried man.)
S3 = Saya suka lelaki belum kahwin itu. (I like that unmarried man.)
As we can see, words that are stemmed to their root words give higher similarity measures.
< 20 >
Based on these “walk-through” observations, we designed our testing so that it
considers the aforementioned issues and elements, i.e., morphological variants of terms and the
effects of connective words (conjunctions), prepositions, and verbs. Table 4 shows the result
of the testing compared with the human judgements. It shows the percentage of accurate
identifications for each approach, or the ability to correctly identify the similarity for all pairs
(usually referred as “recall values”). Concerning the word-to-word similarity methods, the
results clearly show that the probability of intersection provides better outcomes, as evidenced
by experiments 6–10, with the percentage of accurate identifications for all approaches
increased between 5% (the Simo approach) and 10.84% (the Sims approach).
<Table 4>
< 21 >
7 Conclusions and Future Work
While research in this area has been dominated by studies of the English language, little
work has been focused on the Malay language. In this paper, we presented the results of our
sentences. These approaches compare pairs of sentences by first finding the similarity measures
between words. The two proposed word-to-word semantic measures are based on probabilistic
intersections and normalization. Our experiment shows consistent and encouraging results that
indicate the promising potential of applying these approaches to the Malay language.
In our experiment, the Malay MRD lexical database proved to be useful for measuring
The normalization and probabilistic similarity measures achieved a maximum of 59% and 67%
accuracy, respectively, which suggests that the probabilistic method is superior. In addition,
our evaluation shows that identification of morphological variants improves the accuracy of
the semantic similarity measure for Malay sentences, while pronouns and conjunctions have
little effect on improving the accuracy. On average, the normalization and probabilistic
methods respectively show an increase in accuracy of 3.00% and 3.76% among the techniques,
with the highest increase shown by the technique based on combined word order similarity and
word-sentence similarity. On the other hand, the removal of verb information either causes a
deterioration of accuracy or makes no difference for the various approaches. Because of the
former, we can argue that verbs play an important role in contributing meaning to sentences.
Our future research plans include applying the sentence similarity measures to
word-to-word similarity should be extended to other methods such as the term co-occurrence
< 22 >
corpus-based method and the semantic network method, which will require the construction of
Acknowledgments. The authors wish to thank the Ministry of Higher Education for the funds
provided for this project and also the anonymous referees for their helpful and constructive
References
Abdullah, M. T., Ahmad, F., Mahmod, R. T., and Sembok, T. M. (2003). Evaluating the
effectiveness of thesaurus and stemming methods in retrieving Malay translated al-Quran
documents. In T. M. T. Sembok, H. B. Zaman, H. Chen, S. R. Urs, and S.-H. Myaeng
(eds.), Proceedings of the 6th International Conference on Asian Digital Libraries,
(ICADL) 2003, Kuala Lumpur, Malaysia, pp. 663–665.
Ahmad, F., Yusoff, M., and Sembok, T. M. (1996). Experiments with a stemming algorithm
for Malay words. Journal of the American Society of Information Science, 47(12), 909–
18.
Aliguliyev, R.M. (2009). A new sentence similarity measure and sentence based extractive
technique for automatic text summarization. Expert Systems with Applications, 36, 7764–
7772.
Bollegala, D., Matsuo, Y., and Ishizuka, M. (2011). A Web Search Engine-Based Approach to
Measure Semantic Similarity between Words. IEEE Trans. Knowl. Data Eng., 23(7),
977-990.
Buitelaar, P., Cimiano, P., and Magnini, B. (2005). Ontology learning from text: An
overview. In P. Buitelaar, P. Cimiano, and B. Magnini (eds.), Ontology Learning from
Text: Methods, Evaluation and Applications, pp. 1–9. Amsterdam: IOS Press.
Castillo, J.J. and Cardenas, M. E. (2010). Using Sentence Semantic Similarity Based on
WordNet in Recognizing Textual Entailment. In A. Kuri-Morales and G. R. Simari
(eds.), Advances in Artificial Intelligence – 12th Ibero-American Conference on AI
(IBERAMIA), Bahía Blanca, Argentina, pp. 366-375.
Cilibrasi, R. and Vitanyi, P. M. B. (2006). Similarity of objects and the meaning of words. In
J-Y Chai, S. B. Cooper and A. Li (eds.), Proceedings of the 3rd Conference on Theory
and Applications of Models of Computation (TAMC), Beijing, China, pp. 21–45.
Karim, N. S. (1995). Malay Grammar for Academics and Professionals. Kuala Lumpur:
Dewan Bahasa dan Pustaka.
Ko, Y., Park, J., and Seo, J. (2004). Improving text categorization using the importance of
sentences. Information Processing and Management, 40(1), 65–79.
Kong, T. E. and Yusoff, Z. (1995). Natural language analysis in machine translation (MT)
based on the string-tree correspondence grammar (STCG). In Paper presented at the
10th Pacific Asia Conference on Language, Information and Computation (PACLIC10).
Leacock, C. and Chodorow, M. (1998). Combining local context and WordNet sense
similarity for word sense identification. In C. Fellbaum (ed.), WordNet, an Electronic
Lexical Database, pp. 305–332. Boston: The MIT Press.
Lee, M. C. (2011). A novel sentence similarity measure for semantic-based expert systems.
Expert Systems with Applications, 38(5), 6392–6399
Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., and Crockett, K. (2006). Sentence
similarity based on semantic nets and corpus statistics. IEEE Transactions on
Knowledge and Data Engineering, 18(8), 1138–1150.
Liu, H. and Wang, P. (2013). Assessing Sentence Similarity Using WordNet based Word
Similarity. Journal of Software, 8(6), 1451-1458.
Liu, S., Liu, F., Yu, C., and Meng, W. (2004). An effective approach to document retrieval
via utilizing WordNet and recognizing phrases. In K. Jarvelin, J. Allan, P. Bruza and M.
Sanderson (esd.), Proceedings of the 27th Annual International ACM SIGIR
Conference, Sheffield, UK, pp. 266–72.
Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., and Zobel, J. (2005). Similarity
measures for tracking information flow. In O. Herzog, H-J Schek, N. Fuhr, A.
Chowdhury, and W. Teiken (eds.), Proceedings of the CIKM’05, Bremen, Germany, pp.
571–524.
Mihalcea, R., Corley, C., and Strapparave, C. (2006). Corpus based and knowledge based
measures of text semantic similarity. In A. Cohn (ed.), Proceedings of the American
Association for Artificial Intelligence (AAAI 2006), Boston, Massachusetts, pp. 775–
780.
< 24 >
Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM,
38(11), 39–41.
Noah, S. A., Amruddin, A. Y., and Omar, N. (2007). Semantic similarity measures for Malay
sentences. In D. H-L Goh, T. H. Cao, I. Sølvberg, and E. M. Rasmussen (eds.),
Proceedings of the ICADL 2007, Hanoi, Vietnam, pp. 117–26.
Noor, N. K. M., Noah, S. A., Aziz, M. J. A. and Hamzah, M. P. (2012). Malay Anaphor and
Antecedent Candidate Identification: A Proposed Solution. In J-S. Pan, S-M. Chen, N. T.
Nguyen (eds.) Proceedings of the Asia Conference on Intelligent Information and Database
Systems (ACIIDS) (3), Kaohsiung, Taiwan, pp. 141-151
Othman, A. (1993). Pengakar perkataan melayu untuk sistem capaian dokumen. MSc Thesis.
National University of Malaysia, Bangi, Malaysia.
Qiu, G., Bu, J., Chen, C., Huang, P., and Cai, K. (2007). Syntactic impact on sentence
similarity measure in archive-based QA system. In J. Pei, V. S. Tseng, L. Cao, H.
Motoda, G. Xu (eds.), Proceedings of 11th Asia Pacific Conference on Advances in
Knowledge Discovery and Data Mining, Gold Coast, Australia, pp. 769–76.
Salton, G., and Lesk, M. (1971). Computer evaluation of indexing and text processing.
Journal of the ACM, 15(1), 8-36.
Turney, P. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In L. D.
Raedt and P. A. Flach (eds.), Proceedings of the 12th European Conference on Machine
Learning, Freiburg, Germany, pp. 491-502.
< 25 >
Verberne, S., Boves, L., Oostdijk, N., and Coppen, P.-A. (2008). Evaluating paragraph
retrieval for why–QA. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, R. W.
White (ed.), Proceedings of the 30th European Conference on IR Research, ECIR 2008,
Glasgow, UK, pp. 669–73.
Zeng, H-J., He, Q-C., Chen, Z., Ma, W-Y., and Ma, J. 2004. Learning to cluster web search
results. In M. Sanderson, K. Järvelin, J. Allan, and P. Bruza (eds.), Proceedings of the
27th Annual International ACM SIGIR Conference, Sheffield, UK, pp. 210-217.
Zhang, Z. Q., Gentile, A. N., and Ciravegna, F. (2012). Recent advances in methods of lexical
semantic relatedness – a survey. Natural Language Engineering, 19(4), 411–479.
< 26 >
Figure 1. Distribution of word-to-word similarity measures for synonyms, probabilistic
method
Figure 2. Distribution of word-to-word similarity measures for synonyms, normalization
method
Table 1. Basic sentence patterns in Malay
Pattern (1) Noun Phrase (NP) Subject + Noun Phrase (NP) Predicate
Pattern (2) Noun Phrase (NP) Subject + Verb Phrase (VP) Predicate
Pattern (3) Noun Phrase (NP) Subject + Adjective Phrase (AP) Predicate
Pattern (4) Noun Phrase (NP) Subject + Prepositional Phrase (PP) Predicate
similarity)
Target sentence 2
Saya membaca buku sambil minum air kopi.
(I read a book while drinking coffee.)
Saya membaca buku sambil 1 1.00 1.00 1.00 1.00 1.00
minum air kopi.
(I read a book while drinking
coffee.)
Saya membaca buku sambil 2 0.76 0.75 0.76 0.87 0.81
minum air teh.
(I read a book while drinking tea.)
Saya membelek majalah 3 0.39 0.67 0.53 0.68 0.62
sambil minum air teh.
(I skimmed a magazine while
drinking tea.)
Saya menonton televisyen 4 0.41 0.67 0.54 0.67 0.61
sambil minum air teh.
(I watched television while
drinking tea.)
Ahmad menonton televisyen 5 0.34 0.70 0.52 0.44 0.57
sambil minum air teh.
(Ahmad watched television while
drinking tea.)
Saya menonton televisyen 6 0.17 0.22 0.19 0.31 0.26
sambil baring.
(I lay down and watched
television.)
Target Sentence 3
Komputer riba sangat ringan.
(Laptops are very light.)
Komputer riba sangat 1 1.00 1.00 1.00 1.00 1.00
ringan.
(Laptops are very light.)
Komputer riba amat ringan. 2 0.95 0.58 0.76 0.83 0.70
(Laptops are extremely light.)
Komputer riba sangat berat. 3 0.91 1.00 0.95 0.82 0.91
(Laptops are very heavy.)
Kalkulator kecil sangat 4 0.47 0.74 0.61 0.52 0.63
ringan.
(Small calculators are very light.)
Mesin kira sangat ringan. 5 0.71 0.80 0.75 0.70 0.75
(Calculating machines are very
light.)
Meja komputer amat berat. 6 0.39 0.61 0.50 0.33 0.47
(Computer tables are very heavy.)
Target Sentence 4
Agensi kerajaan Malaysia
(Malaysia government agency)
Agensi kerajaan Malaysia 1 1.00 1.00 1.00 1.00 1.00
(Malaysia government agency)
Agensi kerajaan Cina 2 0.81 0.31 0.56 0.69 0.50
(China government agency)
Agensi negara Malaysia 3 0.96 0.89 0.92 0.81 0.85
(Country of Malaysia agency)
Agen negara asing 4 0.36 0.52 0.44 0.25 0.38
(Foreign country agency)
Agen kerajaan Malaysia 5 0.49 0.73 0.61 0.67 0.70
(Malaysia government agent)
Agensi bangsa Malaysia 6 0.87 0.59 0.73 0.72 0.66
(Malaysian tribe agency)
Table 3: Selected initial testing results
Sentence Pairs Human Simv Simo Simv+o Sims Sims+o
similarity
judgement
1Saya main bola. 0.76 0.44 0.61 0.52 0.61 0.60
(I play with the ball.)