Professional Documents
Culture Documents
Data Mining Paper
Data Mining Paper
Data Mining Paper
net/publication/328187206
CITATIONS READS
18 832
3 authors:
Musa Liman
Universiti Putra Malaysia
4 PUBLICATIONS 33 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Desmond Bala Bisandu on 17 October 2018.
Musa Muhammad Liman completed his MSc in Computer Science from the
School of IT and Computing, American University of Nigeria, Yola, Adamawa
State, Nigeria. He received his BSc in Computer Science from the Al-Hikmah
University Ilorin, Kwara State, Nigeria. His research interests include data
mining, bioinformatics and data compression.
1 Introduction
The world we live in is full of data. Computers have been accepted as the best means of
data storage. This is because of the fact that data is saved very easily in the computer with
high convenience, anybody that has access to a computer can be able to do it and more
importantly, many users can share stored information, or send to different locations
(Kriegel et al., 2007). As the number of text documents stored in large databases
increases, this poses a huge challenge of understanding hidden patterns or relationships
inside the data. Text data, being not in numerical format, can hardly be analysed directly
using statistical methods. Information overload or drowning in data is a common
complaint by people as they see the potential value of information, yet are frustrated in
their inability to derive benefit from it due to its volume and complexity (Sowjanya and
Shashi, 2010; Han et al., 2012).
Due to rapid growth of online news articles, journals, books, research papers and web
pages everyday; the need on how to quickly find the most important, interesting, valuable
or entertaining items has arisen. This is because we are overwhelmed by the increasing
volume of information made available online (Bouras and Tsogkas, 2016; Rupnik et al.,
2016). Humans throughout history have used information to achieve lots of great things
such as predicting the future to avoid disaster and to make some vital decisions (Butler
and Keselj, 2009; Jatowt and Au-Yeung, 2011; Biswas et al., 2014; Dhakar and Tiwari,
2014; Bouras and Tsogkas, 2016). The problem of overloading the internet with this huge
amount of information makes searching very tedious to the users, the enormous demands
for techniques that will efficiently and effectively derive profitable knowledge from
these diverse, unstructured information are highly required (Bouras and Tsogkas, 2013;
Popovici et al., 2014; Lwin and Aye, 2017).
One of the most important means to deal with data is classifying or grouping it into
clusters or categories. Classification has played an important and an indispensable role
throughout human history (Wu et al., 2008; Brockmeier et al., 2018). There exist
two types of classification, the supervised and unsupervised. In the supervised
classification, available predefined knowledge is needed, whereas in the unsupervised
classification sometimes referred to as clustering, no predefined labelled data is needed
(Agrawal et al., 1998; Tao et al., 2004).
Grouping similar data such as news article based on their characteristics is an
important issue. Grouping can be done on the basis of some similarity measures. Several
similarity measures (such as gauging, Jaccard, Euclidean, edit and cosine) have
been proposed and applied in computing the similarity between two different textual
documents based on character matching, word semantics and word sense (Damashek,
1995; Huang, 2008; Qiujun, 2010; Svadasa and Jhab, 2014; Jayashree et al., 2014;
Akinwale and Niewiadomski, 2015; Sonia, 2016; Huang et al., 2017). The rationale
behind every given method of measuring the similarity between two textual documents is
Clustering news articles using efficient similarity measure and N-grams 335
based on the increasing quest to improve the quality and the effectiveness of the existing
clustering or filtering techniques (Shah and Mahajan, 2012; Sonia, 2016; Singh et al.,
2017).
Bouras and Tsogkas (2016) proposed a clustering technique which uses traditional
similarity measure with N-gram and Sohangir and Wang (2017a) proposed an efficient
similarity measure known as ‘improved sqrt-cosine similarity measurement’ but did not
test its suitability using N-grams data representation technique.
In this paper, we proposed a technique for clustering news articles using an efficient
similarity measure known as ‘improved sqrt-cosine similarity measurement’ and
word-based N-grams. N-gram is a collection of sequence of characters or word within a
window from a body of text in a document, a given window is chosen based on the
number of grams selected. An experiment on R programming environment has been
conducted to check the accuracy and purity of the proposed clustering technique on
Reuters-21578 and 20Newsgroups datasets, respectively. The accuracy and purity results
of the proposed clustering technique on different values of N-grams are recorded. The
best result of the N-grams is compared with the result of the baseline technique.
Rest of the paper is organised as follows. Section 2 present related concepts.
Section 3 and Section 4 described methodology, experimental results and the comparison
of results with the baseline clustering technique designed by Sohangir and Wang (2017a),
while Section 5 finally presents the conclusion and future work.
2 Related concepts
News articles clustering is a wide area of research that has been on for a very long time in
history which includes several tasks that range from segmenting events of news streams
to tracking and detecting events (Damashek, 1995; Kyle et al., 2012). Clustering
techniques or methods are proposed based on some form of documents presentation,
similarity and machine learning algorithms (Mihalcea and Tarau, 2005; Saini, 2018).
Research works on news article clustering techniques can be broadly categorised into
two: word-based clustering techniques and N-grams-based clustering techniques (Shafiei
et al., 2006; Qiujun, 2010; Ifrim et al., 2014; Rupnik et al., 2016). Formally, the focus
was on clustering techniques for related news articles or documents and the clusters are
the basis for extracting information needed (Toda and Kataoka, 2005; Nyman et al.,
2018). Finding hidden features and clustering these features in trying to identify some
events from a news article is the latter of clustering techniques designs (Miao et al., 2005;
Shah and Mahajan, 2012; Mele and Crestani, 2017).
In the last few decades, research works focused on improving the efficiency of news
clustering techniques. The challenge is that the traditional approaches are designed
with language dependencies using traditional similarity measure cannot perform very
efficiently as the number of news reports increased in different languages, which some
are considered to be short, noisy and published at a speed that is very high (Tao et al.,
2018). One of the main problems in news article clustering techniques is the ability to
perform efficiently on any kind of news articles without distinguishing the language
which the news articles or documents are presented.
Miao et al. (2005) proposed a documents-clustering method using character N-gram
and compared the terms and words-based clusters results, the technique applied character
336 D.B. Bisandu et al.
N-grams to build a feature document frequency (DF × IDF) scheme. The result of the
experiment of their technique shows that using character N-grams give the best clustering
result. Toda and Kataoka (2005) proposed a method for clustering news article which
addressed the problem of retrieving information from information retrieval systems using
a named entity extraction, terms and finally labelling the classes of the document from
the term list. Their technique finds the maximum set of terms as features representing the
categories of the news within some specific time window. They identified the most
frequent term features and list them within a window. Then, the terms are grouped and
analysis technique is applied to determine the most frequent terms. The extraction of
these terms may result to a very large number of terms especially if the pre-processing
methods are not applied. Moreover, describing the detected terms in the news using a
single word set may not be intuitive and can be very difficult to interpret by humans.
In Newman et al. (2006), the authors present an approach to analyse entities and
topics from a news article using statistical topic models this approach only consider how
topics can be generated from a news article but does not consider categorising the news
articles into some clusters or to describe them. In Ikeda et al. (2006), a technique that can
automatically link blog entries with news articles related to such blogs was proposed
using vector space model and cosine similarity to calculate the distance between the blog
and the news without knowing the category of the news article. However, this method
does not apply any clustering technique to know the number of news that belong to what
type of news and have been proven to be less efficient though they tried to improve the
effectiveness of their techniques using an intuitive weighing method (Naughton et al.,
2006).
Huang (2008) conduct an evaluation of clustering techniques based on text similarity
measures and confirmed that Euclidean distance performed worst in the clustering
thereby making any clustering technique that used it less effective. Parapar and Barreiro
(2009) concluded from their experiment on the various clustering algorithms that using
N-grams on any clustering algorithm helps in increasing the effectiveness of such
algorithm and proposes an approach to reduce computational load on the existing
clustering algorithms by using fingerprint method in trimming the size of the documents
before applying the clustering algorithm and their approach perform very well with
respect to saving memory and time in computation, while Karol and Mangat (2013)
uses particle swarm optimisation to evaluate their proposed clustering technique that was
designed using cosine similarity measure. They affirmed that the method of document
representation also affect the quality of a clustering technique. A similar problem was
addressed by Bouras and Tsogkas (2010) who investigated the application of a great
spectrum of clustering algorithms, as well as the measure of distance calculated by
designing a news article clustering technique that make use of three different similarity
measure. From the experiment, the results show that despite the simplicity of the k-means
algorithm, if the right pre-processing methods are applied, it increases the efficiency
of the clustering technique. Analogously, Qiujun (2010) proposed an approach for
extracting the news content which is based on twin pages with the same features
(specifically noisy similarity). The similarity measure applied is based on edit distance
because of its simplicity which gave a fairly high complexity. This technique was
designed just to check the appropriateness of applying text cleaning techniques on
unstructured data from different web pages before clustering.
Park et al. (2011) proposed a news articles clustering technique for contrasting
contentious issues in news article from oppositions that is based on words feature by
Clustering news articles using efficient similarity measure and N-grams 337
using the disputant relations similarity with the issues at hand. This technique used the
word-based representation with the HITS algorithm to calculate the similarities between
different discourses. Li et al. (2011) proposed a two-stage scalable personalised news
recommendation clustering technique which is based on intrinsic user interest, however,
this technique uses hierarchy and topic detection which uses cosine similarity to calculate
the distances between the interest of users before categorising. Bouras and Tsogkas
(2012) proposed a news articles clustering technique using keywords and WordNet by
applying cosine similarity to calculate the distances between the keywords then finally
clustering using the weighted k-means clustering. Consequently, Bouras and Tsogkas
(2013) proposed a method based on word-based N-grams techniques, ‘bag of word’,
WordNet which helps to enrich the N-gram word list by clustering the k-means core
processes and the Wk-means that extend k-means. This method was implemented on
N-gram-based clustering system without given consideration to improving the similarity
measure. The performance of the technique was measured using the clustering index (CI)
with k-means and a previously proposed Wk-means algorithm. Though the research
validated the improvement on the use of N-gram-based data representation, but it has
failed to check if improvement on the similarity measure used on the news articles can
also improve the coherence of the news articles clusters. Qian and Zhai (2014) proposed
a multi-view clustering technique for selecting features in an unsupervised way for
text-images web news data where images learn from a local orthogonal non-negative
matrix factorisation for labelling. However, this technique was designed on the bases of
views on a particular image. Analogously, Xia et al. (2015) proposed a clustering
technique for social news using a topic model known as discriminative bi-term which
excludes bi-terms that are less indicative by topical terms discrimination from general
and specific documents. This technique, however, is language dependent because of the
discrimination attached to the specific document which makes it not flexible.
Other recent and popular techniques for clustering news articles and textual
documents are: Bouras and Tsogkas (2016) designed a document clustering system that
help to solve new user problem based on WordNet database and minimal user ratings.
This system is implemented using word-based N-grams which fetched articles from the
database and make recommendation to new users. The results of the experiment show
that changing the value of the ‘n’ has great impact on the clustering.
Rupnik et al. (2016) designed a method that can track events written in different
languages and can also conduct articles comparison from different languages for making
predictions of events. This was implemented using document similarity measures which
are based on the cross-languages Wikipedia. The method was implemented on a
multi-language system with semantic-based feature selection using probabilistic cosine
similarity measure.
Lwin and Aye (2017) proposed a method for document clustering systems using
hierarchical clustering based on the number of occurrences of word representations in the
dataset not on the frequency of the items. Jaccard similarity measure was used for
calculating the similarity between the documents; Sohangir and Wang, (2017a, 2017b)
proposed a similarity measure based on the Hellinger distance known as ‘improved
sqrt-cosine similarity measurement’. This similarity measure was tested on different
datasets and compared with other existing similarity measures on clustering textual
document that contains data with high dimensionality and was proven to be more robust
in contributing to the quality of the clusters. This measure was tested only on the ‘bag of
338 D.B. Bisandu et al.
words’, but was not tested on sequence of character or words representation of documents
such as N-gram.
Santhiya and Bhuvaneswari (2018) designed a clustering technique and implemented
it on a system applying MapReduce framework for classification of crime in news
articles using MongoDB. However, the authors indicated that there is need to design a
technique that can automatically categorise crime from different sources irrespective of
the language used to present the news.
3 Methodology
The proposed clustering technique consist of the following steps: news articles
pre-processing, news article representation using N-grams, vector space model of the
news articles, dimensionality reduction using threshold on the feature vector and
improved sqrt-cosine similarity measurement. Finally, k-means clustering is applied on
the obtained vector and clusters of news articles are obtained. At the end, these clusters of
news articles are evaluated with a view to discover knowledge.
sequence, text semantics is captured better. Thus, we consider the word-based N-grams;
N-grams are collection of adjacent words. This means that bi-gram, tri-grams, etc. are
obtained. For the N-gram method of representing the news articles, the removal of stop
words are not needed and also other pre-processing such as stemming is not needed.
Thus, the uses of N-grams help in ignoring any grammatical or typographical errors in
the articles. For instance, given the articles {“A1: I am here, A2: I won’t, A3: I am a boy”}
the character 4-grams representation will be {A1: I_am, _am_, am_h, m_he, _her, here,
A2: I_wo, _won, won’t, A3: I_am, _am_, am_a, m_a_, _a_b, a_bo, _boy}.
where mi,j represent the number of times relevant terms sequence ti appears in the
document dj and sum of occurrence of terms in the whole document dj.
We used the weighting formula in equation (2). This is because of its simplicity and
yet having a more accurate result (Sohangir and Wang, 2017a).
1 log10 tft , d if tft , d ! 0
wt , d ® (2)
¯0 otherwise
where wt,d is the normalised log frequency weight and tft,d is the TF of the sequence in the
document dj.
During the news articles clustering, only the dimensions of the feature vector are reduced,
which means that only the number of features to be used for the clustering are reduced. A
threshold is applied to the wt,d values of vector space model in order to select these
features. In N-grams, the highest total number of wt,d weight in the news articles text
collection is selected; N-grams are selected as feature for the news articles clustering
from the news articles collection. Therefore, 50% of the dimensions are successfully
reduced.
¦ pq
n
i i
i 1
ISC ( p, q ) (3)
¦ p ¦
n n
i qi
i 1 i 1
where each document is normalised and the square root of their normalised form, that is,
¦
n
pi is used. pi is document one in normalised form, qi is document two in
i 1
normalised form and ISC(p, q) is their similarity measure. Equation (3) is known as the
efficient similarity measure. It has been chosen because it has been proven as an effective
measure compared to other ‘state of the art’ similarity measures for textual document
clustering on word-based document representation but has not been tested with the
N-grams-based document representation (Sohangir and Wang, 2017a).
the k value in which the within sum of square (WSS) is smallest as generated from our
experiment with the elbows method illustrated in Figure 2. From the result of the
estimated WSS, this is because we want to adhere to the main goal of clustering which is
reducing the WSS distance of the clusters. Figure 2 shows the plot of the within sum of
squares on the two datasets using the elbows method, it is clear from the plot that the best
values of the number of clusters is three because both dataset form the first elbow at the
value of k = 3 (Lebret and Collobert, 2014; Singh et al., 2017).
Figure 2 Graph between within sum of squares and different values of k to determine the best
number of clusters
4 Experimental results
(2
Precision
Recall)
F -measure (4)
(Precision Recall)
where precision is the measure of how specific an object is clustered with respect to the
original class and can be calculated by equation (5).
Number of common pairs in the original class and clustered result (tp)
Precision (5)
Number of pairs in the clustering result ( fp tp)
Recall is the measure of how well can abject be retrieve from the clustered result and can
be calculated by equation (6).
Number of common pairs in the original class and clustered result (tp)
Recall (6)
Number of common pairs in the original class ( fn tp )
Purity is the measure of the quality of a single cluster Cj with respect to the original class
pij . The higher the purity value means better clustering technique. Purity can be
calculated by equation (7).
1
Purity
max j ^ pij ` (7)
Cj
where Cj represent the single clusters in the generated clusters from the original classes of
documents and max j { pij } is the largest number of objects common in both the original
class and the single cluster under consideration.
that the performance of the proposed technique can be checked with respect to the
baseline technique results.
Figure 3 Line chart showing accuracy of proposed technique on different N-grams (see online
version for colours)
According to the results of the average performance across all the two datasets in Table 2,
the number of grams used in generating the clusters affects the effectiveness of the
clustering techniques. Clearly, the tri-grams outperform other N-grams on both dataset
but quad-grams have the poorest results which show that the number of N-grams to
generate clusters solution should be carefully selected. Figure 3 and Figure 4 show the
plots of accuracy and purity of the proposed clustering technique on different values of
N-grams on all datasets, respectively.
Clustering news articles using efficient similarity measure and N-grams 345
Figure 4 Line chart showing purity of proposed technique on different N-grams (see online
version for colours)
Finding efficient and effective technique to cluster textual document like news articles is
a critical and challenging problem in information retrieval. Most of the clustering
techniques are designed based on words frequency with similarity measures such as
cosine, which is based on Euclidean measure. It has been useful in many applications;
however, word-based clustering is not ideal. This is because of the language dependency
and does not work well with high-dimensional multi-lingual data. In this paper, we
proposed a new clustering technique which is based on N-grams. At different windows of
the N-grams from one-gram to quad-grams, comprehensive experiments were conducted
to check the effect of changing the N-grams windows on the proposed clustering
technique. We compare the results of performances of all the grams from one-gram to
quad-grams in order to know which has the best results on all the datasets in various
document understanding tasks. Through the experiments comprehensively, although our
proposed clustering used same similarity measure with the baseline clustering technique
346 D.B. Bisandu et al.
but the N-grams has greater impact on the final accuracy and purity of the generated
clusters which helps our proposed clustering technique to perform favourably compared
to other clustering technique in high-dimensional data.
The following points can be considered in the future work:
1 Different clustering algorithms can be applied to check their performance on the
proposed technique.
2 Each category of articles cluster can be further classified into labelled classes and
sub-classes that are predefined.
References
Aggarwal, C.C. (Ed.) (2012) Mining Text Data, Springer, New York, NY.
Agrawal, R., Dimitrios, G. and Frank, L. (1998) ‘Mining process models from workflow logs’, in
International Conference on Extending Database Technology, pp.467–483, Springer, Berlin,
Heidelberg.
Akinwale, A. and Niewiadomski, A. (2015) ‘Efficient similarity measures for texts matching’,
Journal of Applied Computer Science, Vol. 23, No. 1, pp.7–28.
Biswas, S.K., Sinha, N., Baruah, B. and Purkayastha, B. (2014) ‘Intelligent decision support system
of swine flu prediction using novel case classification algorithm’, International Journal of
Knowledge Engineering and Data Mining, Vol. 3, No. 1, pp.1–19.
Bouras, C. and Tsogkas, V. (2010) ‘Assigning web news to clusters’, in 2010 Fifth International
Conference on Internet and Web Applications and Services (ICIW), IEEE, Vol. 12, pp.1–6.
Bouras, C. and Tsogkas, V. (2012) ‘A clustering technique for news articles using WordNet’,
Knowledge-Based Systems, Vol. 36, No. 2, pp.115–128.
Bouras, C. and Tsogkas, V. (2013) ‘Enhancing news articles clustering using word n-grams’,
in DATA, pp.53–60.
Bouras, C. and Tsogkas, V. (2016) ‘Assisting cluster coherency via n-grams and clustering as a tool
to deal with the new user problem’, International Journal of Machine Learning and
Cybernetics, Vol. 7, No. 2, pp.171–184.
Brockmeier, A.J., Mu, T., Ananiadou, S. and Goulermas, J.Y. (2018) ‘Self-tuned descriptive
document clustering using a predictive network’, IEEE Transactions on Knowledge and Data
Engineering, Vol. 12, No. 2, pp.1–14.
Butler, M. and Keselj, V. (2009) ‘Financial forecasting using character n-gram analysis and
readability scores of annual reports’, in Canadian Conference on AI, Springer, pp.39–51.
Damashek, M. (1995) ‘Gauging similarity with n-grams: language-independent categorization of
text’, Science, New Series, Vol. 267, No. 5199, pp.843–848.
Dhakar, M. and Tiwari, A. (2014) ‘Tree-augmented naïve Bayes-based model for intrusion
detection system’, International Journal of Knowledge Engineering and Data Mining, Vol. 3,
No. 1, pp.20–30.
Han, J., Kamber, M. and Pei, J. (2012) Data Mining: Concepts and Techniques, 3rd ed., Morgan
Kaufmann, Elsevier, Amsterdam.
Huang, A. (2008) ‘Similarity measures for text document clustering’, in Proceedings of the Sixth
New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch,
New Zealand, pp.49–56.
Huang, Z., Yi-Fei, W. and Mei, S. (2017) ‘Visual word based similar image retrieval optimization
by hamming distance’, Sciencepaper Online, Vol. 12, No. 2, pp.1–12.
Ifrim, G., Shi, B. and Brigadir, I. (2014) ‘Event detection in twitter using aggressive filtering and
hierarchical tweet clustering’, in Second Workshop on Social News on the Web (SNOW), ACM
Press, Seoul, Korea, 8 April, pp.1–34.
Clustering news articles using efficient similarity measure and N-grams 347
Ikeda, D., Fujiki, T. and Okumura, M. (2006) ‘Automatically linking news articles to blog entries’,
in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, ACM Press,
Vol. 23, pp.78–82.
Jatowt, A. and Au-Yeung, C. (2011) ‘Extracting collective expectations about the future from large
text collections’, in Proceedings of the 20th ACM International Conference on Information
and Knowledge Management, ACM Press, pp.1259–1264.
Jayashree, R., Murthy, K.S. and Anami, B.S. (2014) ‘Hybrid methodologies for summarisation of
Kannada language text documents’, International Journal of Knowledge Engineering and
Data Mining, Vol. 3, No. 1, pp.82–114.
Karol, S. and Mangat, V. (2013) ‘Evaluation of text document clustering approach based on
particle swarm optimization’, Open Computer Science, Vol. 3, No. 2, pp.69–90.
Kriegel, H-P., Karsten, M., Borgwardt, P.K., Alexey, P., Arthur, Z. et al. (2007) ‘Future trends in
data mining’, Data Mining and Knowledge Discovery, Vol. 15, No. 1, pp.87–97.
Kyle, A., Obizhaeva, A., Sinha, N. and Tuzun, T. (2012) ‘News articles and the invariance
hypothesis’, CEFRN, Vol. 34, No. 3, pp.1–44.
Lebret, R. and Collobert, R. (2014) N-gram-based Low-dimensional Representation for Document
Classification, Vol. 10, No. 12, pp.1–8, ArXiv preprint ArXiv, 1412.6277.
Li, L., Wang, D., Li, T., Knox, D. and Padmanabhan, B. (2011) ‘SCENE: a scalable two-stage
personalized news recommendation system’, in Proceedings of the 34th International ACM
SIGIR Conference on Research and Development in Information Retrieval, ACM Press,
Vol. 23, pp.125–134.
Lwin, M.T. and Aye, M.M. (2017) ‘A modified hierarchical agglomerative approach for efficient
document clustering system’, American Scientific Research Journal for Engineering,
Technology, and Sciences (ASRJETS), Vol. 29, No. 1, pp.228–238.
Mamaysky, H. and Glasserman, P. (2017) Does Unusual News Forecast Market Stress, Working
Papers 16-04, Office of Financial Research, US Department of the Treasury.
Mele, I. and Crestani, F. (2017) ‘Event detection for heterogeneous news streams’, in International
Conference on Applications of Natural Language to Information Systems, Springer, Vol. 34,
pp.110–123.
Miao, Y., Kešelj, V. and Milios, E. (2005) ‘Document clustering using character n-grams:
a comparative evaluation with term-based and word-based clustering’, in Proceedings of the
14th ACM International Conference on Information and Knowledge Management, ACM
Press, pp.357–358.
Mihalcea, R. and Tarau, P. (2005) ‘A language independent algorithm for single and multiple
document summarization’, in Proceedings of IJCNLP, ACM Press, Vol. 5, pp.12–20.
Naughton, M., Kushmerick, N. and Carthy, J. (2006) ‘Clustering sentences for discovering events
in news articles’, in European Conference on Information Retrieval, Springer, Vol. 22,
pp.535–538.
Newman, D., Chemudugunta, C., Smyth, P. and Steyvers, M. (2006) ‘Analyzing entities and topics
in news articles using statistical topic models’, in International Conference on Intelligence
and Security Informatics, Springer, Vol. 24, pp.93–104.
Nyman, R., Kapadia, S., Tuckett, D., Gregory, D., Ormerod, P. and Smith, R. (2018) News and
Narratives in Financial Systems: Exploiting Big Data for Systemic Risk Assessment, Bank of
England.
Parapar, J. and Barreiro, A. (2009) ‘Evaluation of text clustering algorithms with n-gram-based
document fingerprints’, Advances in Information Retrieval, Vol. 3, No. 12, pp.645–653.
Park, S., Lee, K. and Song, J. (2011) ‘Contrasting opposing views of news articles on contentious
issues’, in Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies – Volume 1, Association for Computational
Linguistics, Vol. 1, pp.340–349.
348 D.B. Bisandu et al.
Popovici, R., Weiler, A. and Grossniklaus, M. (2014) ‘On-line clustering for real-time topic
detection in social media streaming data’, in SNOW 2014 Data Challenge, pp.57–63.
Qian, M. and Zhai, C. (2014) ‘Unsupervised feature selection for multi-view clustering on
text-image web news data’, in Proceedings of the 23rd ACM International Conference on
Conference on Information and Knowledge Management, pp.1963–1966, ACM, New York,
NY, USA.
Qiujun, L.A.N. (2010) ‘Extraction of news content for text mining based on edit distance’, Journal
of Computational Information Systems, Vol. 6, No. 11, pp.3761–3777.
Rupnik, J., Muhic, A., Leban, G., Skraba, P., Fortuna, B. and Grobelnik, M. (2016) ‘News across
languages-cross-lingual document similarity and event tracking’, Journal of Artificial
Intelligence Research, Vol. 55, No. 2, pp.283–316.
Saini, A. (2018) ‘An approach to data mining’, International Journal of Computer Science and
Mobile Applications, Vol. 6, No. 1, pp.31–37.
Santhiya, K. and Bhuvaneswari, V. (2018) ‘An automated MapReduce framework for crime
classification of news articles using MongoDB’, International Journal of Applied Engineering
Research, Vol. 13, No. 1, pp.131–136.
Shafiei, M., Wang, S., Zhang, R., Milios, E., Tang, B., Tougas, J. and Spiteri, R. (2006)
A Systematic Study of Document Representation and Dimension Reduction for Text
Clustering, Technical Report Technical Report CS-2006-05, Faculty of Computer Science.
Shah, N. and Mahajan, S. (2012) ‘Document clustering: a detailed review’, International Journal of
Applied Information Systems, Vol. 4, No. 5, pp.30–38.
Singh, K.N., Devi, H.M. and Mahanta, A.K. (2017) ‘Document representation techniques and their
effect on the document clustering and classification: a review’, International Journal of
Advanced Research in Computer Science, Vol. 8, No. 5, pp.1–12.
Sohangir, S. and Wang, D. (2017a) ‘Improved sqrt-cosine similarity measurement’, Journal of Big
Data, Vol. 4, No. 1, pp.25–38.
Sohangir, S. and Wang, D. (2017b) ‘Document understanding using improved sqrt-cosine
similarity’, in 2017 IEEE 11th International Conference on Semantic Computing (ICSC),
IEEE, pp.278–279.
Sonia, F.G. (2016) A Novel Committee-base Clustering Method, Master’s thesis, p.70,
Departamento de Informatica, Pontificia Universidade Catolica Do Rio De Janeiro.
Sowjanya, M.A. and Shashi, M. (2010) ‘Cluster feature-based incremental clustering approach
(CFICA) for numerical data’, International Journal of Computer Science and Network
Security, Vol. 10, No. 9, pp.73–79.
Svadasa, T. and Jhab, J. (2014) ‘A literature survey on text document clustering and ontology based
techniques’, International Journal of Innovative and Emerging Research in Engineering,
Vol. 1, No. 2, pp.8–11.
Tao, H., Hou, C., Liu, X., Yi, D. and Zhu, J. (2018) ‘Reliable multi-view clustering’, International
Journal of Advanced Research in Computer Science, Vol. 5, No. 3, pp.1–8.
Tao, Y., Christos, F., Dimitris, P. and Bin, L. (2004) ‘Prediction and indexing of moving objects
with unknown motion patterns’, in Proceedings of the 2004 ACM SIGMOD International
Conference on Management of Data, ACM Press, pp.611–622.
Toda, H. and Kataoka, R. (2005) ‘A clustering method for news articles retrieval system’,
in Special Interest Tracks and Posters of the 14th International Conference on World Wide
Web, ACM Press, Vol. 12, pp.988–989.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., Steinberg, D. et al. (2008)
‘Top 10 algorithms in data mining’, Knowledge and Information Systems, Vol. 14, No. 1,
pp.1–37.
Xia, Y., Tang, N., Hussain, A. and Cambria, E. (2015) ‘Discriminative bi-term topic model for
headline-based social news clustering’, in FLAIRS Conference, Vol. 12, pp.311–316.