Significance of HTML Tags For Document Indexing and Retrieval

SIGNIFICANCE OF HTML TAGS FOR DOCUMENT INDEXING AND RETRIEVAL
Byurhan Hyusein and Ahmed Patel

Computer Networks and Distributed Systems Research Group, Department of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
ABSTRACT Indexing quality has an overwhelming effect on retrieval effectiveness of search engines. In the past few years it has become one of the major challenges in the search engines area, particularly the task of automatically assigning highquality terms to Web documents, which remains elusive. High indexing and retrieval quality requires work on term selection algorithms. This paper investigates the feasibility of HTML tags to represent the contents of Web documents. Experiments were performed on the WT10g collection of a 1.69-million page corpus using many different combinations of term selection and information retrieval algorithms. KEYWORDS Information retrieval, term selection, tags, search engines, HTML.
1. INTRODUCTION
A useful measurement of term significance in a text document is the term frequency (TF) occurrences (Luhn, 1958; Salton, 1989). Typically, the process of term selection for indexing is performed in three steps: 1. Apply a stemming and stop-word algorithm, i.e. extracting the root of each word and removing the common words (example and, or, etc.) from the text of the document. 2. Compute the term frequency for all remaining terms in each document. 3. Select N terms with high term frequency from each document as index vector. Another term selection method involves inverse document frequency (IDF). After the calculation of the weights of the terms in the documents using TF-IDF weighting functions, the top N terms with the highest weights in all the documents are chosen to represent the index vectors. Since the terms weights depend on the inverse document frequency this method may extract those terms with relatively low document frequency because a lower document frequency will give a higher weight of a term. Furthermore, there is no guarantee that these terms cover a high percentage of the documents and consequently there will be many documents represented by an index vector with zero weight in all entries. In order to prevent this Wong and Fu (Wong and Fu, 2000) proposed an algorithm which involves coverage of the terms. Coverage of a term is defined as the percentage of documents in the collection containing this term. This term selection algorithm consist of: 1. Randomly select a subset of documents with size M from the collection of documents. 2. Extract the set of terms that appear at least once in the documents. 3. Remove stop words and apply stemming algorithm. 4. Count the document frequency of the terms extracted after applying stop-word and stemming. 5. Set lower=k and upper=k . 6. Select all terms with document frequency in the range from lower to upper. 7. Check if the coverage of these terms is larger than pre-defined threshold. If so, stop. If not, set lower=lower-1 and upper=upper+1 and go to the previous step. The dependency of the number and the contents of the selected documents is a weakness of this method.
817
International Conference WWW/Internet 2003
In search engines (SEs) the indexing scheme mainly depends on the type of search which the engine supports. For example in order to support search by phrase SEs such as HotBot, OpenText, Excite, Alta Vista, InfoSeek Guide, etc. employ full text indexing (Gudivada et al, 1997). This kind of indexing may generate large indexes and require powerful computers for rapid searching. To increase the search speed, InfoSeek Guide and HotBot use several computers and distributed indexes. ADSA (Khousainov et al, 2001), Lycos and Worldwide Web Worm use HTML tags of the Web documents to minimize the index size. The text in title, headings, subheadings and the first few lines of the beginning of Web documents form the document indexes in Lycos. If the number of unique terms in the documents exceeds 100, then only the top 100 most-weighted terms are taken. In Worldwide Web Worm a slightly different technique is employed. The indexes are constructed using the title and anchor text plus URLs of the Web documents. However indexes based just on these three HTML tags are unlikely to be entirely satisfactory because it has been estimated that 20 percent of the HTML documents on the Web have no title. Term weighting and document ranking algorithms heavily depend on the type of indexing method used; TF-IDF weighting functions being the most popular choice. However, other kinds of weighting functions are also possible. For example, Lycos uses a TF-IDF weighting function, but terms which appear in the title and/or in the beginning of the Web documents are given more weight because they are assumed to have a higher significance and value. Then the retrieval status value (RSV) is computed as the sum of the weights of matched terms in the document to the query and the documents are sorted in decreasing RSV order. Our investigations focused on empirically evaluating the feasibility of using key Web page elements as proxies to indicate page contents as explained in the next section. Also, we compared text retrieval ranging from partial to full text indexing. This article describes the methods we used and summarizes our results.
2. INDEX CONSTRUCTION
We have examined numerous Web pages to quantify whether, consciously or otherwise, authors use HTML tags to indicate the subject matter of their documents. The following Web page elements were studied: title text (the text, which appears in <TITLE> </TITLE>); anchor text (the text, which appears in <A </A>); emphasized text (in our case emphasized text include only and tags of the HTML documents); and all terms with frequency greater than one, which appears in the whole HTML document. In addition to Worldwide Web Worm indexing scheme we included emphasized text and all terms with frequency greater than one which appear in the whole HTML document. Emphasized text was included because it could convey more meaning than just the terms themselves and it is perceived to call attention to text (Carlainer, 2003). Indexing the terms with frequency greater than one, which appears in the HTML documents contributes for the index size and ensures (but does not guarantee) that the index will not have zero size. Other HTML tags could also contribute for the index quality. Unfortunately our current indexer does not support them. More detailed information about HTML tags can be found on the official Web site of World Wide Web consortium (http://www.w3.org/). Our research aimed to study whether the parts indexed represent the contents of the Web documents, and if so, how they contribute to the average precision of the retrieved results as explained below.
3. EXPERIMENTS AND RESULTS

In our experiments, we used the 10 gigabyte collection of documents (WT10g) (Bailey, 2002) available for participants of Web track main task in the TREC-9 (Voorchees and Harman, 2000; Hawking, 2000) experiments. This collection is a subset of the 100 gigabyte VLC2 collection created by Internet Archive in 1997. Consider the following sets: All - all terms, which appear in the document;
818
SIGNIFICANCE OF HTML TAGS FOR DOCUMENT INDEXING AND RETRIEVAL
AllG1 - all terms in the document with frequency greater than one; T - title texts of the HTML document (the text, which appears in <TITLE> </TITLE>); A - anchor texts of the HTML document (the text, which appears in <A </A>); E - emphasized text of the HTML document (in our case emphasized text include only and tags of the HTML documents). Experiments using the following indexing schemes are performed: All, AllG1, AllG1_T, AllG1_A, AllG1_E, AllG1_T_A, AllG1_T_E, AllG1_A_E, AllG1_T_A_E, T, A, E, T_A, T_E, A_E, and T_A_E. Using the above notation, it may be seen, that e.g. AllG1_T_A_E means that the index includes all terms in the document with term frequency greater than one, plus all terms which appear in the T, A and E sets. We used only the title part of Web track topics 451-500. Stop words lists of 595 English, 172 German and 352 Spanish words were used. Porter's stemming algorithm (Porter, 1980) was applied to both documents and queries. All terms in the query were equally weighted by one. Terms in the documents are weighted using different weighting functions like PIVOT (Singhal et al, 1996), SMART (Salton and Buckley, 1988), INQUERY (Broglio et al, 1994) and ADSA. Inner product is used as the similarity function in the experiments. Table 1 shows the average index size using different indexing schemes, number of relevant documents retrieved and average precision obtained for PIVOT (P), SMART (S), INQUERY (I) and ADSA (A) weighting functions. The results are evaluated using the trec_eval package written by Chris Buckley of Sabir Research (available at ftp://ftp.cs.cornell.edu/pub/smart/).
Table 1. Results obtained using different indexing schemes and weighting algorithms Type of indexing All AllG1_T_A_E AllG1_A_E AllG1_T_A AllG1_A AllG1_T_E AllG1_E AllG1_T AllG1 T_A_E A_E T_A A T_E E T Average index size 132.548 60.590 59.845 57.735 56.989 47.749 46.982 44.270 43.503 30.016 27.599 25.463 22.852 9.633 6.283 3.714 Number of relevant retrieved documents P S I A 1260 1220 1137 1223 1028 997 940 1005 1026 996 941 1007 1019 989 932 997 1022 986 930 998 989 964 907 977 994 968 908 975 980 961 900 965 987 966 906 963 456 438 406 419 420 408 384 405 414 401 374 351 359 348 326 437 270 272 266 391 251 231 227 243 208 212 212 206 Average precision P 0.1591 0.1288 0.1300 0.1280 0.1292 0.1256 0.1270 0.1252 0.1265 0.0479 0.0442 0.0453 0.0420 0.0510 0.0400 0.0243 S 0.1368 0.1115 0.1125 0.1113 0.1122 0.1120 0.1132 0.1126 0.1137 0.0476 0.0387 0.0436 0.0368 0.0488 0.0330 0.0244 I 0.1126 0.0987 0.0996 0.0992 0.1001 0.1001 0.1010 0.1011 0.1021 0.0404 0.0322 0.0369 0.0310 0.0480 0.0290 0.0244 A 0.1673 0.1419 0.1419 0.1413 0.1412 0.1391 0.1390 0.1388 0.1387 0.0514 0.0514 0.0501 0.0407 0.0393 0.0372 0.0244
3.1 Discussion of results

Our results show that full indexing of the text leads to large average index size (132.548 terms in our case) giving high precision and large number of retrieved documents. Small indexes were produced using only the title, anchor and emphasized text, but the results of the retrieval were affected by giving less precision and number of retrieved documents. Indexing terms with frequency greater than one contributes to better precision and higher number of retrieved documents. Anchor and emphasized text also contribute higher number of retrieved documents. Weighting functions however have great impact on the retrieved results. For instance using anchor and emphasized text in the index construction the precision obtained for PIVOT and ADSA were improved while for SMART and INQUERY the average precision were slightly diminished in some of the cases where the terms with frequency greater than one were also used in index construction. Surprisingly, title text did not contribute for the number of retrieved documents and average precision.
819
International Conference WWW/Internet 2003
4. CONCLUSION
In this paper we have presented how the different parts of the Web documents contribute to the average precision in the process of search. The anchor and emphasized text were helpful in indicating the contents of the Web documents. The terms with frequency greater than one are very useful to index and improve the precision of the retrieved results. Our research suggests that anchor and emphasized text can be helpful to improve the ranking of retrieved pages in more sophisticated SEs or meta SEs. The anchor text can also be very beneficial for focused Web crawling. Other Web page elements that may potentially be used for indexing include section and table headings, table captions, image descriptions, surrounding paragraphs and so on. More research is also required to explore the usage of different weighting and similarity functions because one indexing scheme might work better in one case and not in another case. Currently we are concentrating on the development of a more complex indexer which will allow us to index any attribute of a Web document Web. This will permit us to investigate in more detail how the different parts of the documents contribute to the average precision in the process of search.
ACKNOWLEDGEMENT
The ADSA research project is funded by Enterprise Ireland under the Informatics Research Initiative.
REFERENCES
Bailey, P. et al, 2002. Engineering a Multi-Purpose Test Collection for Web Retrieval Experiments. In Information Processing and Management. Broglio, J. et al, 1994. Document Retrieval and Routing Using the INQUERY System. Proceedings of the Third Text REtrieval Conference (TREC-3). Gaithersburg, Maryland, USA, pp 29-39. Carlainer, S., 2003. Five Tips for Type in Online Learning. In Learning Circuits. Available from: <http://www.learningcircuits.org/2003/jan2003/elearn.html> [Accessed July 2nd, 2003]. Gudivada, V. et al, 1997. Information Retrieval on the World Wide Web. In IEEE Internet Computing, Vol. 1, No. 5, pp 58-69. Hawking, D., 2000. Overview of the TREC-9 Web Track. Proceedings of the 9th Text REtrieval Conference (TREC-9). Gaithersburg, Maryland, USA, pp 87-103. Khoussainov, R. et al, 2001. Adaptive Distributed Search and Advertising for WWW. Proceedings of the Fifth World Multiconference on Systemics, Cybernetics and Informatics (SCI 2001). Orlando, Florida, USA, Vol. 5, pp. 73-78. Luhn, H. P., 1958. The automatic creation of literature abstracts. In IBM Journal of Research and Development, Vol. 2, pp 159-165. Porter, M. F., 1980. An algorithm for suffix stripping. In Program, Vol. 14, No. 3, pp 130137. Salton, G. and Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. In Information Processing and Management, Vol. 24, No. 5, pp 513-523. Salton, G., 1989. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley Publishers. Singhal, A. et al, 1996. Pivoted Document Length Normalization. Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA, pp 21-29. Voorchees, E. and Harman, D., 2000. Overview of the Ninth Text REtrieval Conference (TREC-9). Proceedings of the 9th Text REtrieval Conference (TREC-9). Gaithersburg, Maryland, USA, pp 1-15. Wong, W. and Fu, A., 2000. Incremental Document Clustering for Web Page Classification. Proceedings o 2000 f International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000). Aizu-Wakamatsu City, Fukushima, Japan.
820

Significance of HTML Tags For Document Indexing and Retrieval

Uploaded by

Copyright:

Available Formats

You might also like

Significance of HTML Tags For Document Indexing and Retrieval

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Significance of HTML Tags For Document Indexing and Retrieval

Uploaded by

Copyright:

Available Formats

SIGNIFICANCE OF HTML TAGS FOR DOCUMENT INDEXING AND RETRIEVAL

Byurhan Hyusein and Ahmed Patel

International Conference WWW/Internet 2003

3. EXPERIMENTS AND RESULTS