Utilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling

Hok Peng Leung and Wynne Hsu 
School of Computing
National University of Singapore 
Lower Kent Ridge Road, Singapore 119260 
{leunghp, whsu} @comp.nus.edu.sg

Introduction

A successful web information retrieval system requires the ability to determine quickly and accurately whether a document or a link should be further explored. Many researchers have looked into improving the performance of such systems by utilizing different information available from the web documents. In this paper, we propose a fast and accurate approach to determining the relevancy of a document by taking into account the information embedded within these formatting tags. Using such information, we are able to quickly narrow down the scope of our search to the most promising sites. In addition, a new query formulation strategy is proposed to further improve the accuracy of the new approach. A number of experiments have been conducted to test the effectiveness of the proposed approach and the crawling strategy. Experiment results indicate that we are able to achieve a significant improvement over the standard information retrieval algorithm based on tf*idf. Furthermore, our algorithm, unlike the tf*idf scheme, does not require the whole document space to be known in advance. This feature makes our algorithm suitable to be used on the web where it is impossible to know in advance the entire document space.

Our HTML-based Text-Emphasis cum Query Reformulation Approach

The motivation for HTML-Based Text Emphasis is to take into consideration the subjective intent of the author during retrieval by allocating different weights to the different HTML tags used to format the document. The algorithm is based on the idea of recursively assigning higher weights to those terms that are enclosed within some text-emphasis tags. In addition, query phrase formulation is used to process the query as composition of phrases rather than as individual terms. For example, for a query Q="Natural Language Processing", a document that contains the string "Natural LanguageProcessing" is more relevant than those documents that contain single words such as "Natural", or "Language", or "Processing." In this strategy, we first generate all the possible phrases that can be formed from such query string. They include:
  1. (Natural Language Processing)
  2. (Natural) & (Language Processing)
  3. (Natural Language) & (Processing)
  4. (Natural) & (Language) & (Processing)
For each phrase, we attempt to match it to the document iteratively, starting from the longest phrase (case 1), to the shortest phrase (case 4). The results are then weighted and averaged to give the overall similarity measure. Full details can be found in Full Thesis.

Based on the new similarity measure, our crawler decides which is the most relevant page to begin its search. Once a page has been selected, it then determines among the many links that appear within this selected page, the most relevant link to drill down.

Experiment Results

We have performed a number of experiments to measure the precision and recall of our proposed algorithm. These experiments are carried out on a database of 1,400 HTML pages. The pages are pre-classified into 14 categories. Each of these categories contains a minimum of 20 pages. The size of these pages ranges from a few kilobytes to a few megabytes.
Figure 1. Overall Performance Chart.
Figure 1 shows that the overall performance of the HTML-Based Text Emphasis with Query Formulation is no worst than the standard tf*idf. In fact, Sim1 (HTML-Based Text Emphasis with Query Formulation) outperforms both Sim2(HTML -Based Text Emphasis Only) and Sim3(tf*idf) when the recall level is 0.7 or less.

Conclusions

Finding the right information on the web effectively and efficiently is a real and pressing need. In this paper, we propose a partial answer to the above need. Experiments show that our approach is able to give better performance than the standard tf*idf.