Data Mining Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328187206

Clustering news articles using efficient similarity measure and N-grams

Article  in  International Journal of Knowledge Engineering and Data Mining · January 2018


DOI: 10.1504/IJKEDM.2018.095525

CITATIONS READS

18 832

3 authors:

Desmond Bala Bisandu Dr Rajesh Prasad


Cranfield University African University of Science and Technology
33 PUBLICATIONS   58 CITATIONS    22 PUBLICATIONS   84 CITATIONS   

SEE PROFILE SEE PROFILE

Musa Liman
Universiti Putra Malaysia
4 PUBLICATIONS   33 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine Learning View project

Machine Learning in Quantum Computing Capabilities View project

All content following this page was uploaded by Desmond Bala Bisandu on 17 October 2018.

The user has requested enhancement of the downloaded file.


Int. J. Knowledge Engineering and Data Mining, Vol. 5, No. 4, 2018 333

Clustering news articles using efficient similarity


measure and N-grams

Desmond Bala Bisandu, Rajesh Prasad* and


Musa Muhammad Liman
School of IT and Computing,
American University of Nigeria,
Yola, Adamawa State, Nigeria
Email: desmond.bisandu@aun.edu.ng
Email: rajesh.prasad@aun.edu.ng
Email: muhammad.liman@aun.edu.ng
*Corresponding author

Abstract: The rapid progress of information technology and web makes it


easier to store huge amount of collected textual information, e.g., blogs, news
articles, e-mail messages, reviews and forum postings. The growing size of
textual dataset with high-dimensions and natural language pose a big challenge
making it hard for such information to be categorised efficiently. Document
clustering is an automatic unsupervised machine learning technique that
aimed at grouping related set of items into clusters or subsets. The target is
creating clusters with high internal coherence, but different from each other
substantially. This paper presents a new document clustering technique using
N-grams and efficient similarity measure known as ‘improved sqrt-cosine
similarity measure’. Comprehensive experiments are conducted to evaluate our
proposed clustering technique and compared with an existing method. The
results of the experiments show that our proposed clustering technique
outperforms the existing techniques.
Keywords: information retrieval; clustering; similarity measures; data mining;
N-grams.
Reference to this paper should be made as follows: Bisandu, D.B., Prasad, R.
and Liman, M.M. (2018) ‘Clustering news articles using efficient similarity
measure and N-grams’, Int. J. Knowledge Engineering and Data Mining,
Vol. 5, No. 4, pp.333–348.
Biographical notes: Desmond Bala Bisandu completed his MSc in Computer
Science from the School of IT and Computing, American University of
Nigeria, Yola, Adamawa State, Nigeria and BSc in Computer Science from
University of Jos, Plateau State, Nigeria where he is currently working as an
Assistant Lecturer in Computer Science Department. His research interests
include data mining, machine learning algorithms, artificial intelligence,
bioinformatics and data compression.
Rajesh Prasad is currently working as an Assistant Professor at the School of IT
and Computing, American University of Nigeria, Yola. He received his MTech
in Software Engineering and PhD in Computer Science and Engineering from
the Motilal Nehru National Institute of Technology, Allahabad, India. He has
more than 16 years of teaching experience in various universities in India and
abroad. He has published more than 50 research papers in various journals
and conferences. He is a reviewer of various reputed journals. His research
areas include data mining, artificial intelligence, data compression, algorithms,
bioinformatics, etc.

Copyright © 2018 Inderscience Enterprises Ltd.


334 D.B. Bisandu et al.

Musa Muhammad Liman completed his MSc in Computer Science from the
School of IT and Computing, American University of Nigeria, Yola, Adamawa
State, Nigeria. He received his BSc in Computer Science from the Al-Hikmah
University Ilorin, Kwara State, Nigeria. His research interests include data
mining, bioinformatics and data compression.

1 Introduction

The world we live in is full of data. Computers have been accepted as the best means of
data storage. This is because of the fact that data is saved very easily in the computer with
high convenience, anybody that has access to a computer can be able to do it and more
importantly, many users can share stored information, or send to different locations
(Kriegel et al., 2007). As the number of text documents stored in large databases
increases, this poses a huge challenge of understanding hidden patterns or relationships
inside the data. Text data, being not in numerical format, can hardly be analysed directly
using statistical methods. Information overload or drowning in data is a common
complaint by people as they see the potential value of information, yet are frustrated in
their inability to derive benefit from it due to its volume and complexity (Sowjanya and
Shashi, 2010; Han et al., 2012).
Due to rapid growth of online news articles, journals, books, research papers and web
pages everyday; the need on how to quickly find the most important, interesting, valuable
or entertaining items has arisen. This is because we are overwhelmed by the increasing
volume of information made available online (Bouras and Tsogkas, 2016; Rupnik et al.,
2016). Humans throughout history have used information to achieve lots of great things
such as predicting the future to avoid disaster and to make some vital decisions (Butler
and Keselj, 2009; Jatowt and Au-Yeung, 2011; Biswas et al., 2014; Dhakar and Tiwari,
2014; Bouras and Tsogkas, 2016). The problem of overloading the internet with this huge
amount of information makes searching very tedious to the users, the enormous demands
for techniques that will efficiently and effectively derive profitable knowledge from
these diverse, unstructured information are highly required (Bouras and Tsogkas, 2013;
Popovici et al., 2014; Lwin and Aye, 2017).
One of the most important means to deal with data is classifying or grouping it into
clusters or categories. Classification has played an important and an indispensable role
throughout human history (Wu et al., 2008; Brockmeier et al., 2018). There exist
two types of classification, the supervised and unsupervised. In the supervised
classification, available predefined knowledge is needed, whereas in the unsupervised
classification sometimes referred to as clustering, no predefined labelled data is needed
(Agrawal et al., 1998; Tao et al., 2004).
Grouping similar data such as news article based on their characteristics is an
important issue. Grouping can be done on the basis of some similarity measures. Several
similarity measures (such as gauging, Jaccard, Euclidean, edit and cosine) have
been proposed and applied in computing the similarity between two different textual
documents based on character matching, word semantics and word sense (Damashek,
1995; Huang, 2008; Qiujun, 2010; Svadasa and Jhab, 2014; Jayashree et al., 2014;
Akinwale and Niewiadomski, 2015; Sonia, 2016; Huang et al., 2017). The rationale
behind every given method of measuring the similarity between two textual documents is
Clustering news articles using efficient similarity measure and N-grams 335

based on the increasing quest to improve the quality and the effectiveness of the existing
clustering or filtering techniques (Shah and Mahajan, 2012; Sonia, 2016; Singh et al.,
2017).
Bouras and Tsogkas (2016) proposed a clustering technique which uses traditional
similarity measure with N-gram and Sohangir and Wang (2017a) proposed an efficient
similarity measure known as ‘improved sqrt-cosine similarity measurement’ but did not
test its suitability using N-grams data representation technique.
In this paper, we proposed a technique for clustering news articles using an efficient
similarity measure known as ‘improved sqrt-cosine similarity measurement’ and
word-based N-grams. N-gram is a collection of sequence of characters or word within a
window from a body of text in a document, a given window is chosen based on the
number of grams selected. An experiment on R programming environment has been
conducted to check the accuracy and purity of the proposed clustering technique on
Reuters-21578 and 20Newsgroups datasets, respectively. The accuracy and purity results
of the proposed clustering technique on different values of N-grams are recorded. The
best result of the N-grams is compared with the result of the baseline technique.
Rest of the paper is organised as follows. Section 2 present related concepts.
Section 3 and Section 4 described methodology, experimental results and the comparison
of results with the baseline clustering technique designed by Sohangir and Wang (2017a),
while Section 5 finally presents the conclusion and future work.

2 Related concepts

News articles clustering is a wide area of research that has been on for a very long time in
history which includes several tasks that range from segmenting events of news streams
to tracking and detecting events (Damashek, 1995; Kyle et al., 2012). Clustering
techniques or methods are proposed based on some form of documents presentation,
similarity and machine learning algorithms (Mihalcea and Tarau, 2005; Saini, 2018).
Research works on news article clustering techniques can be broadly categorised into
two: word-based clustering techniques and N-grams-based clustering techniques (Shafiei
et al., 2006; Qiujun, 2010; Ifrim et al., 2014; Rupnik et al., 2016). Formally, the focus
was on clustering techniques for related news articles or documents and the clusters are
the basis for extracting information needed (Toda and Kataoka, 2005; Nyman et al.,
2018). Finding hidden features and clustering these features in trying to identify some
events from a news article is the latter of clustering techniques designs (Miao et al., 2005;
Shah and Mahajan, 2012; Mele and Crestani, 2017).
In the last few decades, research works focused on improving the efficiency of news
clustering techniques. The challenge is that the traditional approaches are designed
with language dependencies using traditional similarity measure cannot perform very
efficiently as the number of news reports increased in different languages, which some
are considered to be short, noisy and published at a speed that is very high (Tao et al.,
2018). One of the main problems in news article clustering techniques is the ability to
perform efficiently on any kind of news articles without distinguishing the language
which the news articles or documents are presented.
Miao et al. (2005) proposed a documents-clustering method using character N-gram
and compared the terms and words-based clusters results, the technique applied character
336 D.B. Bisandu et al.

N-grams to build a feature document frequency (DF × IDF) scheme. The result of the
experiment of their technique shows that using character N-grams give the best clustering
result. Toda and Kataoka (2005) proposed a method for clustering news article which
addressed the problem of retrieving information from information retrieval systems using
a named entity extraction, terms and finally labelling the classes of the document from
the term list. Their technique finds the maximum set of terms as features representing the
categories of the news within some specific time window. They identified the most
frequent term features and list them within a window. Then, the terms are grouped and
analysis technique is applied to determine the most frequent terms. The extraction of
these terms may result to a very large number of terms especially if the pre-processing
methods are not applied. Moreover, describing the detected terms in the news using a
single word set may not be intuitive and can be very difficult to interpret by humans.
In Newman et al. (2006), the authors present an approach to analyse entities and
topics from a news article using statistical topic models this approach only consider how
topics can be generated from a news article but does not consider categorising the news
articles into some clusters or to describe them. In Ikeda et al. (2006), a technique that can
automatically link blog entries with news articles related to such blogs was proposed
using vector space model and cosine similarity to calculate the distance between the blog
and the news without knowing the category of the news article. However, this method
does not apply any clustering technique to know the number of news that belong to what
type of news and have been proven to be less efficient though they tried to improve the
effectiveness of their techniques using an intuitive weighing method (Naughton et al.,
2006).
Huang (2008) conduct an evaluation of clustering techniques based on text similarity
measures and confirmed that Euclidean distance performed worst in the clustering
thereby making any clustering technique that used it less effective. Parapar and Barreiro
(2009) concluded from their experiment on the various clustering algorithms that using
N-grams on any clustering algorithm helps in increasing the effectiveness of such
algorithm and proposes an approach to reduce computational load on the existing
clustering algorithms by using fingerprint method in trimming the size of the documents
before applying the clustering algorithm and their approach perform very well with
respect to saving memory and time in computation, while Karol and Mangat (2013)
uses particle swarm optimisation to evaluate their proposed clustering technique that was
designed using cosine similarity measure. They affirmed that the method of document
representation also affect the quality of a clustering technique. A similar problem was
addressed by Bouras and Tsogkas (2010) who investigated the application of a great
spectrum of clustering algorithms, as well as the measure of distance calculated by
designing a news article clustering technique that make use of three different similarity
measure. From the experiment, the results show that despite the simplicity of the k-means
algorithm, if the right pre-processing methods are applied, it increases the efficiency
of the clustering technique. Analogously, Qiujun (2010) proposed an approach for
extracting the news content which is based on twin pages with the same features
(specifically noisy similarity). The similarity measure applied is based on edit distance
because of its simplicity which gave a fairly high complexity. This technique was
designed just to check the appropriateness of applying text cleaning techniques on
unstructured data from different web pages before clustering.
Park et al. (2011) proposed a news articles clustering technique for contrasting
contentious issues in news article from oppositions that is based on words feature by
Clustering news articles using efficient similarity measure and N-grams 337

using the disputant relations similarity with the issues at hand. This technique used the
word-based representation with the HITS algorithm to calculate the similarities between
different discourses. Li et al. (2011) proposed a two-stage scalable personalised news
recommendation clustering technique which is based on intrinsic user interest, however,
this technique uses hierarchy and topic detection which uses cosine similarity to calculate
the distances between the interest of users before categorising. Bouras and Tsogkas
(2012) proposed a news articles clustering technique using keywords and WordNet by
applying cosine similarity to calculate the distances between the keywords then finally
clustering using the weighted k-means clustering. Consequently, Bouras and Tsogkas
(2013) proposed a method based on word-based N-grams techniques, ‘bag of word’,
WordNet which helps to enrich the N-gram word list by clustering the k-means core
processes and the Wk-means that extend k-means. This method was implemented on
N-gram-based clustering system without given consideration to improving the similarity
measure. The performance of the technique was measured using the clustering index (CI)
with k-means and a previously proposed Wk-means algorithm. Though the research
validated the improvement on the use of N-gram-based data representation, but it has
failed to check if improvement on the similarity measure used on the news articles can
also improve the coherence of the news articles clusters. Qian and Zhai (2014) proposed
a multi-view clustering technique for selecting features in an unsupervised way for
text-images web news data where images learn from a local orthogonal non-negative
matrix factorisation for labelling. However, this technique was designed on the bases of
views on a particular image. Analogously, Xia et al. (2015) proposed a clustering
technique for social news using a topic model known as discriminative bi-term which
excludes bi-terms that are less indicative by topical terms discrimination from general
and specific documents. This technique, however, is language dependent because of the
discrimination attached to the specific document which makes it not flexible.
Other recent and popular techniques for clustering news articles and textual
documents are: Bouras and Tsogkas (2016) designed a document clustering system that
help to solve new user problem based on WordNet database and minimal user ratings.
This system is implemented using word-based N-grams which fetched articles from the
database and make recommendation to new users. The results of the experiment show
that changing the value of the ‘n’ has great impact on the clustering.
Rupnik et al. (2016) designed a method that can track events written in different
languages and can also conduct articles comparison from different languages for making
predictions of events. This was implemented using document similarity measures which
are based on the cross-languages Wikipedia. The method was implemented on a
multi-language system with semantic-based feature selection using probabilistic cosine
similarity measure.
Lwin and Aye (2017) proposed a method for document clustering systems using
hierarchical clustering based on the number of occurrences of word representations in the
dataset not on the frequency of the items. Jaccard similarity measure was used for
calculating the similarity between the documents; Sohangir and Wang, (2017a, 2017b)
proposed a similarity measure based on the Hellinger distance known as ‘improved
sqrt-cosine similarity measurement’. This similarity measure was tested on different
datasets and compared with other existing similarity measures on clustering textual
document that contains data with high dimensionality and was proven to be more robust
in contributing to the quality of the clusters. This measure was tested only on the ‘bag of
338 D.B. Bisandu et al.

words’, but was not tested on sequence of character or words representation of documents
such as N-gram.
Santhiya and Bhuvaneswari (2018) designed a clustering technique and implemented
it on a system applying MapReduce framework for classification of crime in news
articles using MongoDB. However, the authors indicated that there is need to design a
technique that can automatically categorise crime from different sources irrespective of
the language used to present the news.

3 Methodology

The proposed clustering technique consist of the following steps: news articles
pre-processing, news article representation using N-grams, vector space model of the
news articles, dimensionality reduction using threshold on the feature vector and
improved sqrt-cosine similarity measurement. Finally, k-means clustering is applied on
the obtained vector and clusters of news articles are obtained. At the end, these clusters of
news articles are evaluated with a view to discover knowledge.

3.1 Clustering technique


The technique in Figure 1 shows all the steps of the proposed technique and the
implemented steps are explained in the sub-sections that follow.

Figure 1 Proposed methodology (clustering technique)


Clustering news articles using efficient similarity measure and N-grams 339

3.1.1 News articles pre-processing


Clustering different textual documents such as news articles is one of the important tasks
in text mining because the data we are dealing with is unstructured in nature, a huge
amount of unstructured or semi-structured data are stored in text format like blogs,
research papers, books, web pages, e-mail messages, XML documents, news articles, etc.
To obtain a structured data format from the unstructured data format, some sequence of
long operations are required to convert this text to the desired format before performing
any further processing. Some of these operations are: stop words removal, lemmatisation,
stemming and white space removal which are mostly applied for the traditional ‘bag of
words’ methods of document representation. The pre-processing technique applied in this
research is different from the traditional methods; this is because using N-grams has a
special pre-processing method that helps in converting any non-character letter or any
multiple spaces due to error at any part of the document into a single space and stop
words do not have any effect in the final result of the clustering like it is having in
the ‘bag of words’ documents representation method. For instance, given the articles
{“A1: I am here, A2: I won’t, A3: I am a boy”} after pre-processing it, the result will be
{“A1: I_am_here, A2: I_won_t, A3: I_am_a_boy”}. The advantage of using this method is
that it does not perform all the other stages of pre-processing like in the traditional ‘bag
of words’ documents representation methods thereby saving computational time (Lebret
and Collobert, 2014).

3.1.2 News articles N-grams representation


N-grams news articles representation is a sequence of ‘textual units’ in ‘n’ adjacencies
(sequence of text of length ‘n’), which are extracted from a particular news article. The
context of interest determined the level which a ‘textual units’ can be identified either as
a character, word or byte. In this research, we are using the word-based N-grams which is
conceptualised by having a small sliding window placed in a given text of sequence,
where only visible words are the ones within the ‘n’ window placed at a given time
(Bouras and Tsogkas, 2013; Mamaysky and Glasserman, 2017). At each window
position, only the words within it are recorded. Some schemes allow window sliding of
more than a single word record for each N-gram.

3.1.2.1 Representation with 1-gram


This is the simplest N-gram called the unigram with n = 1, which falls back to the
traditional ‘bag of words’. This practice is typically for news articles representation as
‘bag of words’. Documents and ‘bag of words’ dimensions are comprised in the vector
space model dimensions. For the purpose of word extraction from the textual document,
the text processing in the traditional ‘bag of words’ are applied. For instance, given the
articles {“A1: I am here, A2: I won’t, A3: I am a boy”}, the 1-gram representation will be
{“A1: ‘I’ ‘am’ ‘here’, A2: ‘I’ ‘won’t’, A3: ‘I’ ‘am’ ‘a’ ‘boy’”}.

3.1.2.2 Representation with N-grams


A long string of symbols are extracted from the documents known as N-grams. This
sequence is in the form of character, word or byte. If the account of word is taken as the
340 D.B. Bisandu et al.

sequence, text semantics is captured better. Thus, we consider the word-based N-grams;
N-grams are collection of adjacent words. This means that bi-gram, tri-grams, etc. are
obtained. For the N-gram method of representing the news articles, the removal of stop
words are not needed and also other pre-processing such as stemming is not needed.
Thus, the uses of N-grams help in ignoring any grammatical or typographical errors in
the articles. For instance, given the articles {“A1: I am here, A2: I won’t, A3: I am a boy”}
the character 4-grams representation will be {A1: I_am, _am_, am_h, m_he, _her, here,
A2: I_wo, _won, won’t, A3: I_am, _am_, am_a, m_a_, _a_b, a_bo, _boy}.

3.1.3 Weight normalisation


Weighting generated N-grams is needed in order to normalise the frequency of the
respective N-grams. The most used weighting method in text mining and information
retrieval is the inverse document frequency (IDF) and term frequency (TF), this gives the
measure of the number of relevant terms in the collected document (Sohangir and Wang,
2017a). TF denotes the total number of sequence of N-grams repeated in a given
document, while the total number of times an N-gram sequence occurs in a given
document is the IDF, i.e., the count of documents that contains the significant sequence.
The TF is calculated using equation (1).
mi , j
tfi , j (1)
¦
j
mk , j
k

where mi,j represent the number of times relevant terms sequence ti appears in the
document dj and sum of occurrence of terms in the whole document dj.
We used the weighting formula in equation (2). This is because of its simplicity and
yet having a more accurate result (Sohangir and Wang, 2017a).
­1  log10 tft , d if tft , d ! 0
wt , d ® (2)
¯0 otherwise

where wt,d is the normalised log frequency weight and tft,d is the TF of the sequence in the
document dj.

3.1.4 Vector space model


An appropriate numerical model is needed to represent the text document dataset in order
for the clustering algorithm to be able to process it. The vector space model is the most
appropriate numerical model presented with suitable term weighting technique. The log
term frequency (LogTF) weighting scheme is used in this research as sequence of terms
weight. Also, the log frequency weighting is normalised to unity, i.e., wt,d ≤ 1. The LogTF
takes the log of the TF based on the condition if the TF is greater than zero.

3.1.5 Dimensionality reduction on vector features


This is needed because text documents with short length form several N-grams which are
supposed to be in the same dimension before applying the similarity measure. A huge
number of features are generated from the N-grams-based representation model. The
dimensionality reduction task is necessary for the sake of time computational complexity.
Clustering news articles using efficient similarity measure and N-grams 341

During the news articles clustering, only the dimensions of the feature vector are reduced,
which means that only the number of features to be used for the clustering are reduced. A
threshold is applied to the wt,d values of vector space model in order to select these
features. In N-grams, the highest total number of wt,d weight in the news articles text
collection is selected; N-grams are selected as feature for the news articles clustering
from the news articles collection. Therefore, 50% of the dimensions are successfully
reduced.

3.1.6 Improved sqrt-cosine similarity measure


High-dimensional data information retrieval is very common, but there is a problem of
space when working with the Euclidean distances (Sohangir and Wang, 2017a). In
machine learning, when dealing with higher dimensional data, Euclidean distances are
rarely considered to be effective measurements. Aggarwal (2012) proves Euclidean norm
measurement from both empirical and theoretical perspective not to be an effective
metric for data mining applications on high-dimensional data. Because in high
dimensional spaces there is concentration on the distance, the distance ratio from far and
near neighbours is almost one to a given target. Due to this, distances from different data
points do not have variations between them. Also, Aggarwal (2012) investigated the
behaviour of the Euclidean norm measures in high-dimensional space. From the results
of this investigation, it was discovered that the lower value of dimensions tends to
perform better. In other words, application on high-dimensional data, distances such as
the Hellinger, is favoured than the Euclidean distances. Therefore, in this research,
equation (3) represents the similarity measurement used to calculate the distances
between the normalised vectors of the news articles.

¦ pq
n
i i
i 1
ISC ( p, q ) (3)
¦ p ¦
n n
i qi
i 1 i 1

where each document is normalised and the square root of their normalised form, that is,

¦
n
pi is used. pi is document one in normalised form, qi is document two in
i 1

normalised form and ISC(p, q) is their similarity measure. Equation (3) is known as the
efficient similarity measure. It has been chosen because it has been proven as an effective
measure compared to other ‘state of the art’ similarity measures for textual document
clustering on word-based document representation but has not been tested with the
N-grams-based document representation (Sohangir and Wang, 2017a).

3.1.7 K-means clustering


This is considered one of the simplest unsupervised machine learning algorithms used in
grouping objects that are similar. This was designed by MacQueen (1967) and thereafter,
Hartigan and Wong (1975) (Han et al., 2012). The news articles clustering groups the
news collection into different groups such that news in the same group share features in
common. k-means clustering is used for clustering the news articles. This algorithm
partitions news automatically into k different clusters. To determine the value of k to be
used while clustering, the sum of square error method is used. In this research, we used
342 D.B. Bisandu et al.

the k value in which the within sum of square (WSS) is smallest as generated from our
experiment with the elbows method illustrated in Figure 2. From the result of the
estimated WSS, this is because we want to adhere to the main goal of clustering which is
reducing the WSS distance of the clusters. Figure 2 shows the plot of the within sum of
squares on the two datasets using the elbows method, it is clear from the plot that the best
values of the number of clusters is three because both dataset form the first elbow at the
value of k = 3 (Lebret and Collobert, 2014; Singh et al., 2017).

Figure 2 Graph between within sum of squares and different values of k to determine the best
number of clusters

4 Experimental results

The experiment in this paper was implemented in R programming environment and


conducted on two different datasets from different domains of application on k-means
clustering model. Table 1 presents the list of the datasets. We used these sets because
they are used commonly as benchmark for text clustering techniques, classification and
other machine learning algorithm’s validation and test. A more elaborate description of
the used sets is as follows:

1 The Reuters-21578 is a document collections appearing at the Reuters newswire


since 1987. R8 and R52 are Reuters-21578 data subsets categorisation text. The
Reuters personnel labelled the contents after collecting the document set.

2 20Newsgroup is a dataset collection of different newsgroups of about


20 newsgroups, around 20,000 newsgroup documents are contained; it is one of
the most used dataset in text mining (Sohangir and Wang, 2017a).

Table 1 Real-world datasets summary

Datasets No. of samples No. of dimensions No. of class


Reuters 2,900 1,000 8–52
20Newsgroups 11,293 1,000 20
Clustering news articles using efficient similarity measure and N-grams 343

4.1 Performance metrics


In measuring the performance of the proposed technique in the experiments, we use
two different performance metrics: accuracy and purity for the proposed clustering
technique bounded between [0, 1] (Sohangir and Wang, 2017a, 2017b). The results of the
performance on different windows of N-grams are recorded in order to check which of
them yield the best to compare the best clustering technique with a baseline clustering
technique.
Accuracy is measured in terms of F-measure. It considers both the precision and the
recall of the clustering result as shown in the proposed framework in Figure 1. Higher
value of F-measure means better clustering technique. Accuracy can be calculated by
equation (4).

(2 Precision Recall)
F -measure (4)
(Precision  Recall)

where precision is the measure of how specific an object is clustered with respect to the
original class and can be calculated by equation (5).

Number of common pairs in the original class and clustered result (tp)
Precision (5)
Number of pairs in the clustering result ( fp  tp)

Recall is the measure of how well can abject be retrieve from the clustered result and can
be calculated by equation (6).

Number of common pairs in the original class and clustered result (tp)
Recall (6)
Number of common pairs in the original class ( fn  tp )

Purity is the measure of the quality of a single cluster Cj with respect to the original class
pij . The higher the purity value means better clustering technique. Purity can be
calculated by equation (7).

1
Purity max j ^ pij ` (7)
Cj

where Cj represent the single clusters in the generated clusters from the original classes of
documents and max j { pij } is the largest number of objects common in both the original
class and the single cluster under consideration.

4.2 Results of experiment


The results of our experiments are provided in this section, we check for the accuracy and
purity of different values of N-grams. As first steps, focus was only on the different
values of accuracy and purity of the proposed clustering technique with different
windows of N-grams in the quest to observe which of the windows have the best
accuracy and purity on all the two datasets. As a second step, we consider the best results
of the various windows and compare it with the baseline clustering technique results so
344 D.B. Bisandu et al.

that the performance of the proposed technique can be checked with respect to the
baseline technique results.

4.3 Overall results


First, the average performances of the proposed technique across all datasets with
different windows of N-grams are compared. Table 2 provides the average performance
metrics across all N-grams from one-gram, bi-grams, tri-grams and quad-grams,
respectively. According to the average values in Table 2, the results of trigram is the best
followed by bi-gram and one-gram but quad-gram has the poor result which indicates that
as the values of grams grow beyond tri-grams both purity and accuracy drops on both
datasets respectively.
Table 2 Average performance of proposed clustering technique with different windows of
N-grams on all datasets

N-grams Datasets Accuracy Purity


1 Reuters 0.3135 0.5521
20Newsgroups 0.3001 0.5600
2 Reuters 0.3435 0.6021
20Newsgroups 0.3201 0.5900
3 Reuters 0.3950 0.9418
20Newsgroups 0.3801 0.9200
4 20Newsgroups 0.2950 0.4418
Reuters 0.2701 0.4100

Figure 3 Line chart showing accuracy of proposed technique on different N-grams (see online
version for colours)

According to the results of the average performance across all the two datasets in Table 2,
the number of grams used in generating the clusters affects the effectiveness of the
clustering techniques. Clearly, the tri-grams outperform other N-grams on both dataset
but quad-grams have the poorest results which show that the number of N-grams to
generate clusters solution should be carefully selected. Figure 3 and Figure 4 show the
plots of accuracy and purity of the proposed clustering technique on different values of
N-grams on all datasets, respectively.
Clustering news articles using efficient similarity measure and N-grams 345

Figure 4 Line chart showing purity of proposed technique on different N-grams (see online
version for colours)

4.4 Results comparison


Table 3 shows the comparisons between the best results generated by tri-gram on the
proposed clustering technique and the baseline clustering technique on all datasets. From
the results, it is clearly indicated that our proposed clustering perform well than the
baseline clustering technique (Sohangir and Wang, 2017a).
Table 3 Average performance of the proposed technique as compared to baseline technique on
all datasets

Techniques Datasets Accuracy Purity


Proposed clustering Reuters 0.3950 0.9418
technique 20Newsgroups 0.3801 0.9200
Baseline clustering Reuters 0.2320 0.5769
technique 20Newsgroups 0.1659 0.4234

5 Conclusions and future works

Finding efficient and effective technique to cluster textual document like news articles is
a critical and challenging problem in information retrieval. Most of the clustering
techniques are designed based on words frequency with similarity measures such as
cosine, which is based on Euclidean measure. It has been useful in many applications;
however, word-based clustering is not ideal. This is because of the language dependency
and does not work well with high-dimensional multi-lingual data. In this paper, we
proposed a new clustering technique which is based on N-grams. At different windows of
the N-grams from one-gram to quad-grams, comprehensive experiments were conducted
to check the effect of changing the N-grams windows on the proposed clustering
technique. We compare the results of performances of all the grams from one-gram to
quad-grams in order to know which has the best results on all the datasets in various
document understanding tasks. Through the experiments comprehensively, although our
proposed clustering used same similarity measure with the baseline clustering technique
346 D.B. Bisandu et al.

but the N-grams has greater impact on the final accuracy and purity of the generated
clusters which helps our proposed clustering technique to perform favourably compared
to other clustering technique in high-dimensional data.
The following points can be considered in the future work:
1 Different clustering algorithms can be applied to check their performance on the
proposed technique.
2 Each category of articles cluster can be further classified into labelled classes and
sub-classes that are predefined.

References
Aggarwal, C.C. (Ed.) (2012) Mining Text Data, Springer, New York, NY.
Agrawal, R., Dimitrios, G. and Frank, L. (1998) ‘Mining process models from workflow logs’, in
International Conference on Extending Database Technology, pp.467–483, Springer, Berlin,
Heidelberg.
Akinwale, A. and Niewiadomski, A. (2015) ‘Efficient similarity measures for texts matching’,
Journal of Applied Computer Science, Vol. 23, No. 1, pp.7–28.
Biswas, S.K., Sinha, N., Baruah, B. and Purkayastha, B. (2014) ‘Intelligent decision support system
of swine flu prediction using novel case classification algorithm’, International Journal of
Knowledge Engineering and Data Mining, Vol. 3, No. 1, pp.1–19.
Bouras, C. and Tsogkas, V. (2010) ‘Assigning web news to clusters’, in 2010 Fifth International
Conference on Internet and Web Applications and Services (ICIW), IEEE, Vol. 12, pp.1–6.
Bouras, C. and Tsogkas, V. (2012) ‘A clustering technique for news articles using WordNet’,
Knowledge-Based Systems, Vol. 36, No. 2, pp.115–128.
Bouras, C. and Tsogkas, V. (2013) ‘Enhancing news articles clustering using word n-grams’,
in DATA, pp.53–60.
Bouras, C. and Tsogkas, V. (2016) ‘Assisting cluster coherency via n-grams and clustering as a tool
to deal with the new user problem’, International Journal of Machine Learning and
Cybernetics, Vol. 7, No. 2, pp.171–184.
Brockmeier, A.J., Mu, T., Ananiadou, S. and Goulermas, J.Y. (2018) ‘Self-tuned descriptive
document clustering using a predictive network’, IEEE Transactions on Knowledge and Data
Engineering, Vol. 12, No. 2, pp.1–14.
Butler, M. and Keselj, V. (2009) ‘Financial forecasting using character n-gram analysis and
readability scores of annual reports’, in Canadian Conference on AI, Springer, pp.39–51.
Damashek, M. (1995) ‘Gauging similarity with n-grams: language-independent categorization of
text’, Science, New Series, Vol. 267, No. 5199, pp.843–848.
Dhakar, M. and Tiwari, A. (2014) ‘Tree-augmented naïve Bayes-based model for intrusion
detection system’, International Journal of Knowledge Engineering and Data Mining, Vol. 3,
No. 1, pp.20–30.
Han, J., Kamber, M. and Pei, J. (2012) Data Mining: Concepts and Techniques, 3rd ed., Morgan
Kaufmann, Elsevier, Amsterdam.
Huang, A. (2008) ‘Similarity measures for text document clustering’, in Proceedings of the Sixth
New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch,
New Zealand, pp.49–56.
Huang, Z., Yi-Fei, W. and Mei, S. (2017) ‘Visual word based similar image retrieval optimization
by hamming distance’, Sciencepaper Online, Vol. 12, No. 2, pp.1–12.
Ifrim, G., Shi, B. and Brigadir, I. (2014) ‘Event detection in twitter using aggressive filtering and
hierarchical tweet clustering’, in Second Workshop on Social News on the Web (SNOW), ACM
Press, Seoul, Korea, 8 April, pp.1–34.
Clustering news articles using efficient similarity measure and N-grams 347

Ikeda, D., Fujiki, T. and Okumura, M. (2006) ‘Automatically linking news articles to blog entries’,
in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, ACM Press,
Vol. 23, pp.78–82.
Jatowt, A. and Au-Yeung, C. (2011) ‘Extracting collective expectations about the future from large
text collections’, in Proceedings of the 20th ACM International Conference on Information
and Knowledge Management, ACM Press, pp.1259–1264.
Jayashree, R., Murthy, K.S. and Anami, B.S. (2014) ‘Hybrid methodologies for summarisation of
Kannada language text documents’, International Journal of Knowledge Engineering and
Data Mining, Vol. 3, No. 1, pp.82–114.
Karol, S. and Mangat, V. (2013) ‘Evaluation of text document clustering approach based on
particle swarm optimization’, Open Computer Science, Vol. 3, No. 2, pp.69–90.
Kriegel, H-P., Karsten, M., Borgwardt, P.K., Alexey, P., Arthur, Z. et al. (2007) ‘Future trends in
data mining’, Data Mining and Knowledge Discovery, Vol. 15, No. 1, pp.87–97.
Kyle, A., Obizhaeva, A., Sinha, N. and Tuzun, T. (2012) ‘News articles and the invariance
hypothesis’, CEFRN, Vol. 34, No. 3, pp.1–44.
Lebret, R. and Collobert, R. (2014) N-gram-based Low-dimensional Representation for Document
Classification, Vol. 10, No. 12, pp.1–8, ArXiv preprint ArXiv, 1412.6277.
Li, L., Wang, D., Li, T., Knox, D. and Padmanabhan, B. (2011) ‘SCENE: a scalable two-stage
personalized news recommendation system’, in Proceedings of the 34th International ACM
SIGIR Conference on Research and Development in Information Retrieval, ACM Press,
Vol. 23, pp.125–134.
Lwin, M.T. and Aye, M.M. (2017) ‘A modified hierarchical agglomerative approach for efficient
document clustering system’, American Scientific Research Journal for Engineering,
Technology, and Sciences (ASRJETS), Vol. 29, No. 1, pp.228–238.
Mamaysky, H. and Glasserman, P. (2017) Does Unusual News Forecast Market Stress, Working
Papers 16-04, Office of Financial Research, US Department of the Treasury.
Mele, I. and Crestani, F. (2017) ‘Event detection for heterogeneous news streams’, in International
Conference on Applications of Natural Language to Information Systems, Springer, Vol. 34,
pp.110–123.
Miao, Y., Kešelj, V. and Milios, E. (2005) ‘Document clustering using character n-grams:
a comparative evaluation with term-based and word-based clustering’, in Proceedings of the
14th ACM International Conference on Information and Knowledge Management, ACM
Press, pp.357–358.
Mihalcea, R. and Tarau, P. (2005) ‘A language independent algorithm for single and multiple
document summarization’, in Proceedings of IJCNLP, ACM Press, Vol. 5, pp.12–20.
Naughton, M., Kushmerick, N. and Carthy, J. (2006) ‘Clustering sentences for discovering events
in news articles’, in European Conference on Information Retrieval, Springer, Vol. 22,
pp.535–538.
Newman, D., Chemudugunta, C., Smyth, P. and Steyvers, M. (2006) ‘Analyzing entities and topics
in news articles using statistical topic models’, in International Conference on Intelligence
and Security Informatics, Springer, Vol. 24, pp.93–104.
Nyman, R., Kapadia, S., Tuckett, D., Gregory, D., Ormerod, P. and Smith, R. (2018) News and
Narratives in Financial Systems: Exploiting Big Data for Systemic Risk Assessment, Bank of
England.
Parapar, J. and Barreiro, A. (2009) ‘Evaluation of text clustering algorithms with n-gram-based
document fingerprints’, Advances in Information Retrieval, Vol. 3, No. 12, pp.645–653.
Park, S., Lee, K. and Song, J. (2011) ‘Contrasting opposing views of news articles on contentious
issues’, in Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies – Volume 1, Association for Computational
Linguistics, Vol. 1, pp.340–349.
348 D.B. Bisandu et al.

Popovici, R., Weiler, A. and Grossniklaus, M. (2014) ‘On-line clustering for real-time topic
detection in social media streaming data’, in SNOW 2014 Data Challenge, pp.57–63.
Qian, M. and Zhai, C. (2014) ‘Unsupervised feature selection for multi-view clustering on
text-image web news data’, in Proceedings of the 23rd ACM International Conference on
Conference on Information and Knowledge Management, pp.1963–1966, ACM, New York,
NY, USA.
Qiujun, L.A.N. (2010) ‘Extraction of news content for text mining based on edit distance’, Journal
of Computational Information Systems, Vol. 6, No. 11, pp.3761–3777.
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B. and Grobelnik, M. (2016) ‘News across
languages-cross-lingual document similarity and event tracking’, Journal of Artificial
Intelligence Research, Vol. 55, No. 2, pp.283–316.
Saini, A. (2018) ‘An approach to data mining’, International Journal of Computer Science and
Mobile Applications, Vol. 6, No. 1, pp.31–37.
Santhiya, K. and Bhuvaneswari, V. (2018) ‘An automated MapReduce framework for crime
classification of news articles using MongoDB’, International Journal of Applied Engineering
Research, Vol. 13, No. 1, pp.131–136.
Shafiei, M., Wang, S., Zhang, R., Milios, E., Tang, B., Tougas, J. and Spiteri, R. (2006)
A Systematic Study of Document Representation and Dimension Reduction for Text
Clustering, Technical Report Technical Report CS-2006-05, Faculty of Computer Science.
Shah, N. and Mahajan, S. (2012) ‘Document clustering: a detailed review’, International Journal of
Applied Information Systems, Vol. 4, No. 5, pp.30–38.
Singh, K.N., Devi, H.M. and Mahanta, A.K. (2017) ‘Document representation techniques and their
effect on the document clustering and classification: a review’, International Journal of
Advanced Research in Computer Science, Vol. 8, No. 5, pp.1–12.
Sohangir, S. and Wang, D. (2017a) ‘Improved sqrt-cosine similarity measurement’, Journal of Big
Data, Vol. 4, No. 1, pp.25–38.
Sohangir, S. and Wang, D. (2017b) ‘Document understanding using improved sqrt-cosine
similarity’, in 2017 IEEE 11th International Conference on Semantic Computing (ICSC),
IEEE, pp.278–279.
Sonia, F.G. (2016) A Novel Committee-base Clustering Method, Master’s thesis, p.70,
Departamento de Informatica, Pontificia Universidade Catolica Do Rio De Janeiro.
Sowjanya, M.A. and Shashi, M. (2010) ‘Cluster feature-based incremental clustering approach
(CFICA) for numerical data’, International Journal of Computer Science and Network
Security, Vol. 10, No. 9, pp.73–79.
Svadasa, T. and Jhab, J. (2014) ‘A literature survey on text document clustering and ontology based
techniques’, International Journal of Innovative and Emerging Research in Engineering,
Vol. 1, No. 2, pp.8–11.
Tao, H., Hou, C., Liu, X., Yi, D. and Zhu, J. (2018) ‘Reliable multi-view clustering’, International
Journal of Advanced Research in Computer Science, Vol. 5, No. 3, pp.1–8.
Tao, Y., Christos, F., Dimitris, P. and Bin, L. (2004) ‘Prediction and indexing of moving objects
with unknown motion patterns’, in Proceedings of the 2004 ACM SIGMOD International
Conference on Management of Data, ACM Press, pp.611–622.
Toda, H. and Kataoka, R. (2005) ‘A clustering method for news articles retrieval system’,
in Special Interest Tracks and Posters of the 14th International Conference on World Wide
Web, ACM Press, Vol. 12, pp.988–989.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., Steinberg, D. et al. (2008)
‘Top 10 algorithms in data mining’, Knowledge and Information Systems, Vol. 14, No. 1,
pp.1–37.
Xia, Y., Tang, N., Hussain, A. and Cambria, E. (2015) ‘Discriminative bi-term topic model for
headline-based social news clustering’, in FLAIRS Conference, Vol. 12, pp.311–316.

View publication stats

You might also like