Improving Effectiveness and Efficiency of Web Search by Graph-based Text Representation

Improving Effectiveness and Efficiency of Web Search by Graph-based Text Representation

Junji Tomita and Yoshihiko Hayashi
NTT Cyber Space Laboratories, Japan
{j.tomita, y.hayashi}


Introduction

Improving the effectiveness and efficiency of text search has been a central issue in the field of information retrieval. A better algorithm that ranks relevant documents higher out of the huge volume of irrelevant ones is required, especially for improving the effectiveness of Web search. Furthermore, as the target information can rarely be obtained through single-shot queries, a better way of displaying search results is desired to improve the efficiency of Web search, because it would help the user to assess the relevance of the returned documents. This paper demonstrates how the newly introduced Graph-based Text Representation can improve both the effectiveness and efficiency of Web search.

Graph-based Text Representation

We proposed a graph-based text representation model[1], that can represent both documents and queries. Such a graph is called "Subject Graph"; an instance is given in Figure-1. In the graph, a node represents a term in the text, while a link denotes an association between the linked terms. Significance of the terms and the term-term associations are represented as weights assigned to them. The model does not restrict the weighting scheme employed, and we can adopt conventional term-statistics-based weighting schemes. However, it should be remarked here that the Graph-based model is possibly richer than the conventionally utilized Vector Space Model (VSM)[2] in which term-term associations are not incorporated.

Figure-1: A Subject Graph.

Improving Effectiveness: Subject Graph Matching

VSM represents both queries and documents as term vectors, and calculates the similarity based on the inner product or cosine measure. While VSM has been excessively used in many search engines, it is reported that the model is problematic in handling documents with several topics[3]. Such documents are frequently seen on the Web these days. In contrast to VSM, our Graph-based model calculates the similarity based on Subject Graph Matching. The similarity from the Graph-based model is more correct than that from VSM, because the term-term association is adequately incorporated into the similarity calculations. Details of the weighting scheme, the matching algorithm and the evaluation results are shown in [1].

Improving Efficiency: Digest Graph for Search Result

The output of search engines is usually formatted as page titles and mechanically generated excerpts. Such representation fails to well support user in assessing relevance. Textual summaries adequately generated in response to the query would be more preferable, however this approach has not been fully established. A better display could also be achieved via graphical representation. We applied our Graph-based model for generating a graphical digest representation of a searched document. The resulting graph, called "Digest Graph" is a subgraph of the Subject Graph for the entire document. The subgraph is generated on the fly in response to the current query. A search results page from our Web search engine based on the proposed method is depicted in Figure-2. A user can intuitively and immediately understand the subject of each document from the terms and the term-term associations in the graph.

Figure-2: A Search Results page with Digest Graphs

Conclusions and Future works

In this paper, we showed that our Graph-based text representation model could improve the effectiveness of Web search by Subject Graph Matching. This model also improves the efficiency of Web search by providing Digest Graphs to the user; the graphs are generated on the fly in response to the user query. The Digest Graph can be further utilized for relevance feedback by extracting feedback terms from it. To further enhance the performance of the model, a scheme to assign weights to terms and term-term associations is crucially important, and will be investigated.

References

  1. Junji Tomita and Hiroshi Takeno(1998), Proposal for a new IR system using subject graph and word's weighting by the relation (in Japanese), IPSJ-SIGFI, 52, 3, pp. 17-24
  2. G. Salton and A. Wong and C. S. Yang(1975), A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol. 18, pp. 613-620.
  3. Hearst, M. A. and Plaunt, C.(1993), Sub-topic Structuring for Full-Length Document Access, Proceedings of ACM-SIGIR'93, pp. 59-68.