Topic-Oriented Query Expansion for Web Search
Shao-Chi Wang* and Yuzuru Tanaka
Meme Media Laboratory, Graduate School of Information Science and Technology, Hokkaido University, Japan
*The first author is now in Yahoo Japan Corporation
23-26 May, 2006
Categories & Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information search and retrieval, clustering, query formulation, retrieval models
Theory, Experimentation, Measurement
query expansion, topic-oriented, information bottleneck, term-term similarity matrix, intracluster similarity, intercluster similarity
1 IntroductionOne of the problems in selecting candidate terms for the query expansion strategy is the ambiguity of the original query term. In the current research, we suppose that each query term can be separated into multiple concepts, and the identical concepts of a term shall be grouped into an identical cluster. In each cluster, a bundle of terms will keep a unique topic that only relates to one concept of the original query. In the cluster to which the query term concepts of our concern belong, other terms can be selected as candidates of the query expansion. We call this strategy ``topic-oriented query expansion''. We employ the Information Bottleneck (IB) method for clustering purposes. Before performing the clustering technique, we first propose a new definition for the construction of a term-term similarity matrix. The current work is based on the problem of word sense disambiguation and its probabilistic solutions . Furthermore, we propose intracluster and intercluster similarity measures to evaluate the relevance of the topic of the cluster associated with the candidate terms respectively to the retrieved documents in the cluster itself, and to the documents in the other clusters. The result shows that the clusters obtained by our method have higher intracluster cohesiveness and lower intercluster cohesiveness.
2 Topic-oriented query expansion modelFigure 1 describes the outline of the topic-oriented query expansion model which consists of five steps. They are,
- (1) to collect pages that come from incoming links.
- (2) to extract extended anchor texts.
- (3) to construct a term-term similarity matrix.
- (4) to establish a topic-oriented cluster.
- (5) to rank terms.
The first two steps can be readily achieved. In the following sections, we will introduce the three remaining steps.
2.1 Constructing a term-term similarity matrixThe conventional term-term similarity matrix of reference was redefined as follows:
2.2 Establishing a topic-oriented clusterWe employ sIB (sequential Information Bottleneck) to cluster the extracted terms in the proposed model. The input information structure is the obtained term-term similarity matrix in Section 2.1.
2.3 Ranking termsBased on the clustering results, we propose a conditional entropy to rank terms as follows.
3 Experiment and evaluation
3.1 Description of experimentsHere, we report an example of searching Web documents with a query ``Web mining''. We picked up top 10 valid pages and set the maximum number of the incoming links that point to one core document to 100. Tables 1 and 2 show the top 10 terms corresponding to the conventional and the proposed definitions. Note that the number in the bracket after each query term denotes the number of its occurrences in each cluster.
3.2 Description of evaluationThe intracluster and intercluster similarities which are based on proximity between topics are described as follows.
A term vector (, C is the set of clusters) associated with the expanded query is defined by the equation (2) and a document vector () is defined by the equation (3). and denote the weights of the terms associated with the expanded query and the document, respectively. the set of the documents.
The average intracluster and intercluster similarities are defined as follows:
The larger intracluster similarity and the smaller intercluster similarity correspond to the higher cohesiveness of the cluster, namely the higher topic proximity of the retrieved documents in the cluster. The results are exhibited in Table 3. In comparison with the conventional definition, the average intracluster similarity of the proposed definition is improved for the gain of 79.1% whereas the average intercluster similarity is decreased for the loss of 36.0%. This shows that the new definition of the term-term similarity matrix results in significantly better performance than the conventional one in the topic-oriented query expansion model.
4 ConclusionsIn order to develop a topic-oriented query expansion strategy, we have proposed a new definition of the term-term similarity matrix to employ the IB clustering technique, and two measures to evaluate the relevance. Experimental results and evaluations have verified a significant improvement obtained by the proposed query expansion model.
5 AcknowledgmentsThe work presented in this paper has been supported by the 21st Century COE (Center of Excellence) Program of Japan Society of the Promotion of Science (JSPS).
B.-Y. Ricardo and R.-N. Berthier, Modern Information Retrieval, 1999
E.-N. Efthimiadis, Query expansion, 1996.
M. Sanderson, Word sense disambiguation and information retrieval, 1994.
N. Slonim et al., Unsupervised document classification using sequential information
N. Tishby et al., The information bottleneck method, 1999.
S. Hinrich, Automatic word sense discrimination, 1998.
T. Hofmann, Probabilistic latent semantic indexing, 1999.
V. Lavrenko and W. B. Croft, Relevance-based language models, 2001.