International Journal of Computer Engineering and Technology (IJCET)

0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME


ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 6, November - December (2013), pp. 355-366


Shailendra Kumar singh1,
1, 3 2, 4

Gunjan Verma2, Devesh Som3, Mohd. Shamsul Haq4

Meerut Institute of Engineering & Technology, Meerut, Computer Science Department Meerut Institute of Engineering & Technology, Meerut, Master of Computer Applications

ABSTRACT Due to the abundance of unstructured data available today, this was an interesting to research for finding an automated way to retrieve information, or to respond to a structured or an unstructured query. Domain specific information is useful in processing natural language, information retrieval and systems reasoning. This paper presents the document retrieval approach based on combination of latent semantic index (LSI) and a clustering algorithm. The idea is to first retrieve papers and create initial clusters based on LSI. Then, we use flat clustering method to further group similar documents in clusters. This paper also presents an algorithm for clustering that aims at dealing with the fact that is ROCK algorithm. We try to show that ROCK algorithm give the better result. The main advantage of our method is that it forces the centroid vector towards the extremities, and consequently gets a completely different starting point compared to the standard algorithm. Keywords: Latent Semantic Index (LSI), Clustering Algorithm, ROCK Algorithm, SVD. 1. INTRODUCTION 1.1 Semantic Search Semantic Search attempts to augment and improve traditional search results (based on Information Retrieval technology) by using data from the Semantic Web. 1.2 Ontologies in Semantic Web Ontology is a data model, which can be used to describe a set of concepts and the relationships between those concepts within a domain. Ontologies work as the main component in knowledge representation for the Semantic Web. Research groups in both America and Europe developed Ontology modeling languages as The DARPA Agent Markup Language (DAML) and Ontology Inference Layer (OIL).

1.3. Clustering A large dataset can be divided into a few smaller ones; each contains data that are close in some sense. This procedure is called clustering, which is a common operation in data mining. In information and text retrieval, clustering is useful for organization and search of large text collections, Clustering interfaces employ a fairly new class of algorithms called post-retrieval document clustering algorithms. 1.4. Goodness Measure The presented criterion function which can be used to estimate the goodness of clusters. The best clustering of points were those that resulted in the highest values for the criterion function. Since our goal is to find a clustering that maximizes the criterion function, we use a measure similar to the criterion function in order to determine the best pair of clusters to merge at each step of ROCK's hierarchical clustering algorithm. For a pair of clusters Ci,Cj , let link[Ci;Cj] store the number of cross links between clusters Ciand Cj, that is, pq2Ci;prCj link(pq; pr). Then, we define the goodness measure g (Ci; Cj) for merging clusters Ci; Cj as follows.

Fig 1.4 Goodness Measure Formula This naive approach may work well for well-separated clusters. In order to remedy the problem, we divide the number of cross links between clusters by the expected number of cross links between them. Thus, if every point in Ci has nif() neighbours, and then the expected number of links involving only points in the cluster is approximately ni1+2f() . 2. CLUSTERING ALGORITHM ROCK's hierarchical clustering algorithm is presented in this algorithm, belongs to the class of agglomerative hierarchical clustering algorithms It accepts as input the set S of n sampled points to be clustered and the number of desired clusters k. The procedure begins by computing the number of links between pairs of points in Step 1. Initially, Procedure Cluster (S,k) Begin 1. Link := comput_links(S) 2. For each s S do 3. Q[s] = build_local_heap(links,s) 4. Q = build_global_heap(S,q) 5. While size(Q) > k do{ 6. U = extract_max (Q) 7. V := max(q[u]) 8. Delete (Q,v) 9. W := merge(u,v) 10. For each x q[u] U q[v] do { 11. Link[x,w] := link [x,u] +link [x,v]

12. Delete (q[x],u) : delete (q[x],v) 13. Insert(q[x],w,g(x,w)): insert (q[w],x,g(x,w)) 14. Update(Q,x,q[x]) 15. } 16. Insert (Q,w,q[w]) 17. Deallocate (q[u]) : deallocate(q[v]) 18. } each point is a separate cluster. For each cluster i, we build a local heap q[i] and maintain the heap during the execution of the algorithm. q[i] contains every cluster j such that link[i, j] is nonzero. The clusters j in q[i] are ordered in the decreasing order of the goodness measure with respect to i, g(i, j).In addition to the local heaps q[i] for each cluster i, the algorithm also maintains an additional global heap Q that contains all the clusters. Furthermore, the clusters in Q are ordered in the decreasing order of their best goodness measures. Thus, g(j, max(q[j])) is used to order the various clusters j in Q, where max(q[j]), the max element in q[j], is the best cluster to merge with cluster j. At each step, the max cluster j in Q and the max cluster in q[j] are the best pair of clusters to be merged. The while-loop in Step 5 iterates until only k clusters remain in the global heap Q. In addition ,it also stops clustering if the number of links between every pair of the remaining clusters becomes zero. In each step of the while-loop, the max cluster u is extracted from Q by extract max and q[u] is used to determine the best cluster v for it. Since clusters u and v will be merged, entries for u and v are no longer required and can be deleted from Q. Clusters u and v are then merged in Step 9 to create a cluster w containing |u|+|v| points. There are two tasks that need to be carried out once clusters u and v are merged: (1) for every cluster that contains u or v in its local heap, the elements u and v need to be replaced with the new merged cluster w and the local heap needs to be updated, and (2) a new local heap for w needs to be created. Both these tasks are carried out in the for-loop of Step 10-15. The number of links x and w is simply the sum of the number of links between x and u, and x and v. This is used to compute q(x,w), the new goodness measure for the pair of clusters x and w, and the two clusters are inserted into each other's local heaps. 2.1.Computation of Links One way of viewing the problem of computing links between every pair of points is to consider an n x n adjacency matrix A in which entry A[i, j] is 1 or 0 depending on whether or not points I and j, respectively, are neighbours. The number of links between a pair of points i and j can be obtained by multiplying row i with column j (that is, l=1n=1 A [i,l] *A [l, j]). Thus, the problem of computing the number of links for all pairs of points is simply that of multiplying the adjacency matrix A with itself, The time complexity of the naive algorithm to compute the square of a matrix is O(n3). However the problem of calculating the square of a matrix is a well-studied problem and wellknown algorithms such as Stassens algorithm runs in time O (N2.81). The best complexity possible currently is O (N2.37) due to the algorithm by Coppers field and Winograd.


Fig 2.1.1. Computing Links We expect that, on an average, the number of neighbours for each point will be small compared to the number of input points n, causing the adjacency matrix A to be sparse. For such sparse matrices, the algorithm in Figure 2.1.1. Provides a more efficient way of computing links. For every point, after computing a list of its neighbours, the algorithm considers all pairs of its neighbours. Thus, the complexity of the algorithm is i mi2which is O (nmm ma), where ma and mm are the average and maximum number of neighbours for a point, respectively. In the worst case, the value of mm can be n in which case the complexity of the algorithm becomes O. 3. PROBLEM DEFINITION The main problem in modeling an efficient information retrieval process is that a user cannot express his information needing straight forwardly in a query which is posted to an information repository. A users query represents just an approximation of his information need. Consequently a user query should be refined in order to ensure the retrieval of that query as much as relevant products. Unfortunately, most of the information retrieval systems do not provide a co-operative support in the query refinement process, so that a user is forced to change his query on his own in order to find the most suitable results. Indeed, although in an interactive query refinement process. A user is provided with a list of terms that appear frequently in retrieved documents, the explanation of their impact on the information retrieval process is completely missing. 4. TASKS TO PERFORM To accomplish this goal we should perform the following tasks: 1. Pre-processing 2. Normalization 3. Latent Semantic Indexing based on SVD. 4. Clustering (ROCK algorithm) 4.1 Pre-processing Pre-processing involves the reformatting, or filtering, of a text document, to facilitate meaningful statistical analysis.


Table 4.1.1. (a) Document = vc5.txt Tokens or Terms word 0 = www word 1 = tim word 2 = berners word 3 = lee word 4 = internet word 5 = cern word 6 = web Frequency frequency = 4 frequency = 4 frequency = 4 frequency = 4 frequency = 1 frequency = 1 frequency = 1

Table 4.1.2 (b) Document = vc6.txt word 0 = internet word 1 = role word 2 = tim word 3 = berners word 4 = lee word 5 = cern word 6 = server word 7 = web frequency = 1 frequency = 1 frequency = 1 frequency = 1 frequency = 1 frequency = 4 frequency = 4 frequency = 5

Table 4.1.3 (c) Document = vc7.txt word 0 = network word 1 = security word 2 = internet word 3 = web word 4 = applications frequency = 6 frequency = 6 frequency = 4 frequency = 4 frequency = 4

4.2 Normalization Normalization is a process in which we calculate the normalized weight of each word that we have obtained from pre-processing.. Then calculate the weight of each term, by the following formula:

Fig 4.2.1 Where Wi,k is the weight of ith term in kth document and nk is the total number of terms in that document. This weight corresponds to term weight in a document. But this term weight should be normalized for the entire set of documents, not for only one document. This is called as normalizing the term weight. In the next step of Normalization, the weights for each individual document are combined into normalized analysis of the whole collection of documents. To do this we must take into account the fact that a term may have a large weight simply because the document in which it occurs is small, rather than because it occurs frequently throughout the document collection. To eliminate this problem, the normalized weight of a term is calculated as

Fig 4.2.2 This process is a fairly standard normalization for document length, as explained by Greengrass [5].The Normalized output of the words occurring more than twice, in the Tested document of Washington post are shown in Table below. Word - Weight Table 4.2.1(a) Document2 Document3

Column Row Row 0 ====1 Row ====2 Row ====3 Row ====4 Row ====5 Row ====6 Row ====7 Row ====8 Row ====9 Row ==== Row 10 ==== Row 11 ====



0.211 0.211 0.211 0.211 0.053 0.053 0.053 0.000 0.000 0.000 0.000 0.000

0.000 0.056 0.056 0.056 0.056 0.222 0.278 0.056 0.222 0.000 0.000 0.000

0.000 0.000 0.000 0.000 0.167 0.000 0.167 0.000 0.000 0.250 0.250 0.167

word = www word = tim word = burners word = lee word = internet word = cern word = web word = role word = server word = network word = security word = applications


Column Row Row 0 ==== Row 1 ==== Row 2 ==== Row 3 ==== Row 4 ==== Row 5 ==== Row 6 ==== Row 7 ==== Row 8 ==== Row 9 ==== Row 10 ==== Row 11 ====

Document1 0.489 0.489 0.489 0.489 0.122 0.122 0.122 0.000 0.000 0.000 0.000 0.000

Table 4.2.2(b) Document2 Document3 0.000 0.127 0.127 0.127 0.127 0.508 0.635 0.127 0.508 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.365 0.000 0.365 0.000 0.000 0.548 0.548 0.365

Terms word = www word = tim word = berners word = lee word = internet word = cern word = web word = role word = server word = network word = security word = applications

4.3. Creating Term-Document matrix At this stage we can represent a textual document as a set of meaningful normalized terms. The normalized term weights, collectively form the matrix W, where Wi,k = NormalizedWeighti, k. This matrix is referred to as term-document matrix. A term-document matrix contains terms as rows and documents as columns. In this step, we create term-document matrix that describes the occurrence of meaningful terms in each document of the collection. To create such a matrix, we keep track of different meaningful terms occurred in different documents. While reading new text document, new terms are added into matrix as rows, their frequencies are added in the respective columns. While reading new document the system should check whether these new terms are grammatical forms of some previous terms (i.e. eliminating stemming words).The output of this process is a term-document matrix containing distinct, meaningful terms in the entire collection as rows, and documents in the collection as columns. 4.4. Latent Semantic Indexing At this stage we wish to determine a set of concepts from the term-document matrix, where a concept is defined as set of related terms. We accomplish this by using a method called Latent Semantic Indexing (LSI), which primarily involves decomposing the matrix W using Singular Value Decomposition (SVD). 4.4.1. Singular Value Decomposition SVD is well-known matrix decomposition method. It decomposes the matrix A, i.e. the m x n term-document matrix, with m terms, and n documents, as A = U*S*VT. Where U is m x r matrix, called the term matrix, V is r x n matrix, called the document matrix, and S is r x r diagonal matrix containing singular values of A in its diagonal in descending order. In this decomposition, the singular values i corresponds to the vector ui, the ith column of U and vi, the ith row of V. Without

loss of any generality we can assume that the columns of U, the rows of V, and the diagonal values of S have been arranged so that the singular values are in descending order, moving down the diagonal. The decomposition of matrix A is as shown in the figure 4.4.1.

Fig 4.4.1 SVD and LSI we form a new matrix As = Us*Ss*VsT .Detailed information about SVD can be obtained from SVD packageBerry [1] [2] and Spotting Topics with the SVD, Charles Nicholas and Randall Dahlberg[6]. 5. CONSTRUCTING DOCUMENT ONTOLOGY Constructing document ontology is essentially building concept nodes and term nodes from term matrix (U) and document matrix (V), which we have obtained from SVD.A concept node represents a concept and contains information about its concept name, terms that belong to that concept, and their weights in that concept. The name of a concept is generated automatically and is a hyphenated string of the five most frequent terms in that concept. Each column in document matrix (U) corresponds to a concept node. A term node represents a term and contains information about its term name, concept that tit belongs to, and its weight in different concepts. The name of a term is generated automatically and is simply the term itself. Each row in document matrix (U) corresponds to a term node. 6. CLUSTERING The following steps are done to implement the clustering algorithm Step-1 In this step create the clusters for the given documents Table 6 (a) Cluster-Name : CLuster1 and No-of Documents Ni:1 Keyword in The Cluster Are : [, www, tim, berners, lee, lee, internet, CERN, web] Documents in the cluster are : i = 0 Table 6 (b) Cluster-Name : CLuster2 and No-Of-Documents Ni : 1 Keyword in The Cluster Are : [, internet, role, tim, berners, lee, CERN, CERN, web, server, server] Documents in the cluster are : i = 0 doc2 doc1


Table 6 ( c ) Cluster-Name : CLuster3 and No-Of-Documents Ni : 1 Keyword in The Cluster Are : [network, security, security, internet, web, applications, applications, ] Documents in the cluster are : i = 0 doc3

Table 6 (d) Cluster-Name : CLuster4 and No-Of-Documents Ni : 1 Keyword in The Cluster Are : [, GSM, technology, related, mobile, communication, cellular, systems, growing, for, satellite] Documents in the cluster are : i = 0 doc 4

Table 6 (e) Cluster-Name : CLuster5 and No-Of-Documents Ni : 1 Keyword in The Cluster Are : [, cellular, network, or, mobile, radio, distributed, over, land, areas, called, cells, radio, network, cellular] Documents in the cluster are : i=0 doc 5 Table 6 (f) Cluster-Name : CLuster6 and No-Of-Documents Ni : 1 Keyword in The Cluster Are : [frequencies, reused, other, cells, provided, that, same, not, adjacent, neighboring, as, would, cause, co, channel, interference, interference, ] Documents in the cluster are : i = 0 doc 6

Step-2 After creating clusters calculating the adjacency matrix for the given documents which is given in ROCK algorithm Table 6 (g) Initial adjacency_matrix Is 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1

Step-3 After that calculating the initial links matrix for the given documents which is given in ROCK algorithm Initial link_matrix Is 3 3 0 1 0 0 3 4 0 2 0 0 0 0 1 0 0 0 1 2 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Table 6 (h) Step-4 After that calculating the Goodness Measure from the initial links matrix for the given documents and arrange it into ascending order Table 6 (i)
For Cluster 0 Goodness Between (C0,C1) = 0.6161829393253857 Goodness Between (C0,C2) = 0.6161829393253857 Goodness Between (C0,C3) = 0.0 Goodness Between (C0,C4) = 0.0 Goodness Between (C0,C5) = 0.0

Table 6 (j)
For Cluster 1 Goodness Between (C1,C0) = 0.6161829393253857 Goodness Between (C1,C2) = 0.6161829393253857 Goodness Between (C1,C3) = 0.0 Goodness Between (C1,C4) = 0.0 Goodness Between (C1,C5) = 0.0

Table 6 (k)
For Cluster 2 Goodness Between (C2,C0) = 0.6161829393253857 Goodness Between (C2,C1) = 0.6161829393253857 Goodness Between (C2,C3) = 0.0 Goodness Between (C2,C4) = 0.6161829393253857 Goodness Between (C2,C5) = 0.0

Table 6 (l)
For Cluster 3 Goodness Between (C3,C0) = 0.0 Goodness Between (C3,C1) = 0.0 Goodness Between (C3,C2) = 0.0 Goodness Between (C3,C4) = 0.0 Goodness Between (C3,C5) = 0.0


Table 6 (m)
For Cluster 4 Goodness Between (C4,C0) = 0.0 Goodness Between (C4,C1) = 0.0 Goodness Between (C4,C2) = 0.6161829393253857 Goodness Between (C4,C3) = 0.0 Goodness Between (C4,C5) = 0.0

Table 6 (n)
For Cluster 5 Goodness Between (C5,C0) = 0.0 Goodness Between (C5,C1) = 0.0 Goodness Between (C5,C2) = 0.0 Goodness Between (C5,C3) = 0.0 Goodness Between (C5,C4) = 0.0

Step-5 The next step is merging the clusters Table 6 (o)

index1 = 0 index2 = 1 Merging clusters cluster 1 =CLuster1 index = 0 cluster 2 =CLuster2 index = 1 Cluster Removed CLuster2

Step-6 Fire the query and retrieve the appropriate cluster Table 6 (p)
Enter your query or keywords: Tim Cluster Matched With Query Is

Table 6 (q)
CLUSTER-NAME : CLuster1 and No-Of-Documents Ni : 2 Keyword in The Cluster Are : [, tim, berners, lee, lee, internet, CERN, CERN, web] Documents in the cluster are : doc1- a www and timberners lee www and timberners lee www and timberners lee www and timberners lee internet CERN web a a doc2 - a internet and role of timberners lee . CERNCERNCERN web server web server web server web server web server


7. CONCLUSION In this paper, a new concept is proposed that makes use of links to measure the similarity / proximity between pair of data points with categorical attributes. A robust hierarchical clustering algorithm ROCK is used along with SVD that employs links and node distances for merging clusters. This method extends to metric similarity measures that are relevant in situations a domain expert / similarity table is the only source of knowledge. For better results some other clustering algorithm can be further used to improve the performance of the information retrieval system. 8. ACKNOWLEDGEMENT Our thanks to the experts and authors and referenced journals who have contributed towards development of the paper and help us for making the concepts clear. 9. REFERENCES 1. Berry, M. W. Dumais, S. T. O Brein, G. W. (December 1995). Using leaner Algebra intelligent information retrieval. 2. Berry, M. W., Do,T., O Brein, Krishna, V., Varadhan, S., SVDPACKC: Version 1.0 Users guide, tech. Report CS -93-194, University of Tennesse, Knoxville, TN, October 1993. 3. Dr. Sadanand Srivastava, Dr. James Gill de lamadrid, Yuri Karakshyan, Document Ontology Extractor, CADIP 00. 4. Dr. Sadanand Srivastava, Dr. James Gill de Lamadrid, Yuri Karakshyan, Document Ontology from a document using SVD. 5. Greengrass, E. (February 1997). Information Retrieval an Overview. 6. Nicholas, C., Dahlberg, R. (March 1998). Spotting Topics with Singular Value Decomposition, principles of Digital Document processing, FIARO, St. Malo. 7. A. Maedche, S. Staab, Ontology Learning for the semantic Web, IEEE intelligence Systems 16(2) (2001) 72-79. 8. Golub, G., Luk,F., Overton, M., A Block Lanczos Method for computing the singular values and corresponding singular vectors of a matrix, ACM transactions on mathematical software, 7(2),pp.149-169, 1981. 9. SVD and LSI tutorial 1 -to-calculations.html. 10. Vinu P.V., Sherimon P.C. and Reshmy Krishnan, Development of Seafood Ontology for Semantically Enhanced Information Retrieval, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 154 - 162, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 11. Meghana. N.Ingole, M.S.Bewoor and S.H.Patil, Context Sensitive Text Summarization using Hierarchical Clustering Algorithm, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 322 - 329, ISSN Print: 0976 6367, ISSN Online: 0976 6375. 12. Prakasha S, Shashidhar Hr and Dr. G T Raju, A Survey on Various Architectures, Models and Methodologies for Information Retrieval, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 6367, ISSN Online: 0976 6375.


