A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets
Mehran Sahami
Google Inc.
Mountain View, CA, USA
sahami@google.com
Timothy D. Heilman
Google Inc.
Mountain View, CA, USA
tdh@google.com
Copyright is held by the World Wide Web Conference Committee (IW3C2).
Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
ACM 1595933239/06/0005.
ABSTRACT
Determining the similarity of short text snippets, such as search queries, works poorly with traditional document similarity measures (e.g., cosine), since there are often few, if any, terms in common between two short text snippets. We address this problem by introducing a novel method for measuring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search results to provide greater context for the short texts. In this paper, we define such a similarity kernel function, mathematically analyze some of its properties, and provide examples of its efficacy. We also show the use of this kernel function in a largescale system for suggesting related queries to search engine users.
Categories & Subject Descriptors
H.3.3 Information Systems Information Search and Retrieval; I.2.6 Artificial Intelligence Learning; I.2.7 Artificial Intelligence Natural Language Processing [Text analysis]
General Terms
Algorithms, Experimentation
Keywords
Text similarity measures, Web search, Information retrieval, Kernel functions, Query suggestion
1 Introduction
In analyzing text, there are many situations
in which we wish to determine how similar two short text snippets
are. For example, there may be different ways to describe some
concept or individual, such as ``United Nations
SecretaryGeneral'' and ``Kofi Annan'', and we would like to
determine that there is a high degree of semantic
similarity between these two text snippets. Similarly, the
snippets ``AI'' and ``Artificial Intelligence'' are very similar
with regard to their meaning, even though they may not share any
actual terms in common. Directly applying traditional document
similarity measures, such as the widely used cosine coefficient
[17,16], to such short text
snippets often produces inadequate results, however. Indeed, in
both the examples given previously, applying the cosine would
yield a similarity of 0 since each given text pair contains no
common terms. Even in cases where two snippets may share terms,
they may be using the term in different contexts. Consider the
snippets ``graphical models'' and ``graphical interface''. The
first uses graphical in reference to graph structures
whereas the second uses the term to refer to graphic displays.
Thus, while the cosine score between these two snippets would be
0.5 due to the shared lexical term ``graphical'', at a semantic
level the use of this shared term is not truly an indication of
similarity between the snippets. To address this problem, we
would like to have a method for measuring the similarity between
such short text snippets that captures more of the semantic
context of the snippets rather than simply measuring their
termwise similarity. To help us achieve this goal, we can
leverage the large volume of documents on the web to determine
greater context for a short text snippet. By examining documents
that contain the text snippet terms we can discover other
contextual terms that help to provide a greater context for the
original snippet and potentially resolve ambiguity in the use of
terms with multiple meanings. Our approach to this problem is
relatively simple, but surprisingly quite powerful. We simply
treat each snippet as a query to a web search engine in order to
find a number of documents that contain the terms in the original
snippets. We then use these returned documents to create a
context vector for the original snippet, where such a
context vector contains many words that tend to occur in context
with the original snippet (i.e., query) terms. Such context
vectors can now be much more robustly compared with a measure
such as the cosine to determine the similarity between the
original text snippets. Furthermore, since the cosine is a valid
kernel, using this function in conjunction with the generated
context vectors makes this similarity function applicable in any
kernelbased machine learning algorithm [4] where (short) text
data is being processed. While there are many cases where getting
a robust measure of similarity between short texts is important,
one particularly useful application in the context of search is
to suggest related queries to a user. In such an application, a
user who issues a query to a search engine may find it helpful to
be provided with a list of semantically related queries that he
or she may consider to further explore the related information
space. By employing our short text similarity kernel, we could
match the user's initial query against a large repository of
existing user queries to determine other similar queries to
suggest to the user. Thus, the results of the similarity function
can be directly employed in an enduser application. The approach
we take in constructing our similarity function has relations to
previous work in both the Information Retrieval and Machine
Learning communities. We explore these relations and put our work
in the context of previous research in Section 2. We then formally define
our similarity function in Section 3 and present initial examples of
its use in Section 4. This is followed by a
mathematical analysis of the similarity function in Section
5. Section
6 presents a
system for related query suggestion using our similarity
function, and an empirical evaluation of this system is given in
Section 7.
Finally, in Section 8
we provide some conclusions and directions for future work.
2 Related Work
The similarity function we present here is
based on query expansion techniques [3,13] which have long
been used in the Information Retrieval community. Such methods
automatically augment a user query with additional terms based on
documents that are retrieved in response to the initial user
query or by using an available thesaurus. Our motivation for and
usage of query expansion greatly differs from this previous work,
however. First, the traditional goal of query expansion has been
to improve recall (potentially at the expense of precision) in a
retrieval task. Our focus, however, is on using such expansions
to provide a richer representation for a short text in order to
potentially compare it robustly with other short texts. Moreover,
traditional expansion is focused on creating a new query for
retrieval rather than doing pairwise comparisons between short
texts. Thus, the approach we take is quite different than the use
of query expansion in a standard Information Retrieval context.
Alternatively, information retrieval researchers have previously
proposed other means of determining query similarity. One early
method proposed by Raghavan and Sever [14] attempts to
measure the relatedness of two queries by determining differences
in the ordering of documents retrieved in response to the two
queries. This method requires a total ordering (ranking) of
documents over the whole collection for each query. Thus,
comparing the pairwise differences in rankings requires
time, where is the number of documents in
the collection. In the context of the web, where billion^{1}, this algorithm quickly becomes
intractable. Later work by Fitzpatrick and Dent [9] measures query
similarity using the normalized set overlap (intersection) of the
top 200 documents retrieved for each query. While this
algorithm's runtime complexity easily scales to the web, it will
likely not lead to very meaningful similarity results as the
sheer number of documents in the web collection will often make
the set overlap for returned results extremely small (or empty)
for many related queries that are not nearly identical. We show
that this is indeed the case in our experimental and theoretical
results later in the paper. In the context of Machine Learning,
there has been a great deal of work in using kernel methods, such
as Support Vector Machines for text classification [11,8]. Such work has recently
extended to building specialized kernels aimed at measuring
semantic similarity between documents. We outline some of these
approaches below, and show how they differ from the work
presented here. One of the early approaches in this vein is
Latent Semantic Kernels (LSK) [5],
which is a kernelbased extension to the wellknown Latent
Semantic Indexing (LSI) [6]
proposed in the Information Retrieval community. In LSK, a kernel
matrix is computed over text documents, and the
eigendecomposition of this matrix is used to compute a new
(lower rank approximation) basis for the space. The dimensions of
the new basis can intuitively be thought of as capturing
``semantic concepts'' (i.e., roughly corresponding to covarying
subsets of the dimensions in the original space). While there may
be some superficial similarities, this approach differs in
fundamental respects from our work. First, our method is aimed at
constructing a new kernel function, not using an existing kernel
matrix to infer ``semantic dimensions''. Also, our method takes a
lazy approach in the sense that we need not compute an
expansion for a given text snippet until we want to evaluate the
kernel function. We never need to explicitly compute a full
kernel matrix over some set of existing text snippets nor its
eigendecomposition. Indeed, the kernel we present here is entire
complimentary to work on LSK, as our kernel could be used to
construct the kernel matrix on which the eigendecomposition is
performed. An approach more akin to that taken here is the work
of Kandola et al. [12] who define a
kernel for determining the similarity of individual terms based
on the collection of documents that these terms appear in. In
their work, they learn a Semantic Proximity Matrix that
captures the relatedness of individual terms by essentially
measuring the correlation in the documents that contain these
terms. In our work, the kernel we consider is not attempting to
just determine similarity between single terms, but entire text
snippets. Moreover, our approach does not require performing an
optimization over an entire collection of documents (as is
required in the previous work), but rather the kernel between
snippets can be computed online selectively, as needed. Previous
research has also tried to address learning a semantic
representation for a document by using crosslingual techniques
[18].
Here, one starts with a corpus of document pairs, where each pair
is the same document written in two different languages. A
correlation analysis is then performed between the corpora in
each language to determine combinations of related words in one
language that correlate well with combinations of words in the
other language, and thereby learn word relations within a given
language. Obviously, the approach we take does not require such
paired corpora. And, again, we seek to not just learn
relationships between single terms but between entire arbitrary
short texts. Thus, while there has been a good deal of work in
determining semantic similarities between texts (which highlights
the general importance of this problem), many of which use kernel
methods, the approach we present has significant differences with
that work. Moreover, our approach provides the compelling
advantage that semantic similarity can be measured between
multiterm short texts, where the entire text can be considered
as a whole, rather than just determining similarity between
individual terms. Furthermore, no expensive preprocessing of a
corpus is required (e.g., eigendecomposition), and the kernel
can easily be computed for a given snippet pair as needed. We
simply require access to a search engine (i.e., text index) over
a corpus, which can be quite efficiently (linearly) constructed
or can be obviated entirely by accessing a public search engine
on the Web, such as the Google API
(http://www.google.com/apis).
3 A New Similarity Function
Presently, we formalize our
kernel function for semantic similarity. Let represent a short text snippet^{2}. Now, we compute
the query expansion of ,
denoted , as follows:
1. Issue as a query to a search engine .
2. Let be the set of (at most) retrieved
documents
3. Compute the TFIDF term vector for each
document
4. Truncate each vector to include its highest
weighted terms
5. Let be the centroid of the normalized
vectors :
6. Let be the normalization of the centroid :
We note that to be precise, the computation of really should be parameterized by both the query and the search engine used. Since we assume that remains constant in all computations, we omit this parameter for brevity. There are several modifications that can be made to the above procedure, as appropriate for different document collections. Foremost among these is the term weighting scheme used in Step 3. Here, we consider a TFIDF vector weighting scheme [15], where the weight associated with with term in document is defined to be:
where is the frequency of in , is the total number of documents in the corpus, and is the total number of documents that contain . We compute and using a large sample of documents from the web. Clearly, other weighting schemes are possible, but we choose TFIDF here since it is commonly used in the IR community and we have found it to empirically give good results in building representative query expansions. Also, in Step 4, we set the maximum number of terms in each vector = 50, as we have found this value to give a good tradeoff between representational robustness and efficiency. Also, in Step 2, we need not choose to use the entirety of retrieved documents in order to produce vectors. We may choose to limit ourselves to create vectors using just the contextually descriptive text snippet for each document that is commonly generated by Web search engines. This would make our algorithm more efficient in terms of the amount of data processed, and allows us to make ready use of the results from public web search engines without having to even retrieve the full actual underlying documents. Of course, there remains the question of how large such descriptive texts provided by search engines need to be in order to be particularly useful. Empirically, we have found that using 1000 characters (in a token delimited window centered on the original query terms in the original text) is sufficient to get accurate results, and increasing this number does not seem to provide much additional benefit. Evaluating a variety of term weighting or text windowing schemes, however, is not the aim of this work and we do not explore it further here. Rather we simply seek to outline some of the issues that may be of interest to practitioners and provide some guidance on reasonable values to use that we have found work well empirically. Finally, given that we have a means for computing the query expansion for a short text, it is a simple matter to define the semantic kernel function as the inner product of the query expansions for two text snippets. More formally, given two short text snippets and , we define the semantic similarity kernel between them as:
4 Initial Results With Kernel
To get a cursory evaluation for
how well our semantic similarity kernel performs, we show results
with the kernel on a number of text pairs, using the Google
search engine as the underlying document retrieval mechanism. We
attempt to highlight both the strengths and potential weaknesses
of this kernel function.

We examined several text snippet pairs to determine the similarity score given by our new webbased kernel, the traditional cosine measure, and the set overlap measure proposed by Fitzpatrick and Dent. We specifically look at three genres of text snippet matching: (i) acronyms, (ii) individuals and their positions, and (iii) multifaceted terms.^{3} Examples of applying the kernel are shown in Table 1, which is segmented by the genre of matching examined. The first section of the table deals with the identification of acronyms. In this genre, we find two notable effects using our kernel. First, from the relatively high similarity scores found between acronyms and their full name, it appears that our kernel is generally effective at capturing the semantic similarity between an acronym and its full name. Note that the kernel scores are not 1.0 since acronyms can often have multiple meanings. Related to this point, our second observation is that our kernel function (being based on contextual text usage on the web) tends to prefer more common usages of an acronym in determining semantic similarity. For example, the text ``AI'' is determined to be much more similar to ``artificial intelligence'' than ``artificial insemination'' (even though it is a valid acronym for both), since contextual usage of ``AI'' on the web tends to favor the former meaning. We see a similar effect when comparing ``term frequency inverse document frequency'' to ``tf idf'' and ``tfidf''. While the former acronym tends to be more commonly used (especially since the subacronyms ``tf'' and ``idf'' are separated), the still reasonable score over 0.5 for the acronym ``tfidf'' shows that the kernel function is still able to determine a solid level of semantic similarity. It is not surprising that the use of cosine similarity is entirely inappropriate for such a task (since the full name of an acronym virtually never contains the acronym itself). Moreover, we find, as expected, that the set overlap measure leads to very low (and not very robust) similarity values. Next, we examined the use of our kernel in identifying different characterizations of individuals. Specifically, we considered determining the similarity of the name of a notable individual with his prominent role description. The results of these examples are shown in the second section of Table 1. In order to assess the strengths and weaknesses of the kernel function we intentionally applied the kernel to both correct pairs of descriptions and individuals as well looking at pairs involving an individual and a close, but incorrect, description. For example, while Kofi Annan and George W. Bush are both prominent world political figures, the kernel is effective at determining the correct role matches and assigning them appropriately high scores. In the realm of business figures, we find that the kernel is able to distinguish Steve Ballmer as the current CEO of Microsoft (and not Bill Gates). Bill Gates still gets a nontrivial semantic similarity with the role ``Microsoft CEO'' since he was indeed the former CEO, but he is much more strongly (by a over a factor of 2) associated correctly with the text ``Microsoft founder''. Similarly, the kernel is successful at correctly identifying the current Google CEO (Eric Schmidt) from Larry Page (Google's founder and former CEO). We also attempted to test how readily the kernel function assigned high scores for inappropriate matches by trying to pair Bill Gates as the founder of Google and Larry Page as the founder of Microsoft. The low similarity scores given by the kernel show that it does indeed find little semantic similarity between these inappropriate pairs. Once again, the kernel value is nonzero since each of the individuals is indeed the founder of some company, so the texts compared are not entirely devoid of some semantic similarity. Finally, we show that even though Larry Page has a very common surname, the kernel does a good job of not confusing him with a ``web page'' (although the cosine gives a inappropriately high similarity due to the match on the term ``page''). Lastly, we examined the efficacy of the kernel when applied to texts with multifaceted terms  a case where we expect the raw cosine and set overlap to once again do quite poorly. As expected, the kernel does a reasonable job of determining the different facets of terms, such as identifying ``space exploration'' with ``NASA'' (even though they share no tokens), but finding that the similarity between ``vacation travel'' and ``space travel'' is indeed less than the cosine might otherwise lead us to believe. Similar effects are seen in looking at terms used in context, such as ``machine'', ``graphical'', and ``java''. We note that in many cases, the similarity values here are not as extreme as in the previous instances. This has to do with the fact that we are trying to measure the rather fuzzy notion of aboutness between semantic concepts rather than trying to identify an acronym or individual (which tend to be much more specific matches). Still, the kernel does a respectable job (in most cases) of providing a score above 0.5 when two concepts are very related and less than 0.3 when the concepts are generally thought of as distinct. Once again, the low similarity scores given by the set overlap method show that in the context of a large document collection such as the web, this measure is not very robust. As a side note, we also measured the set overlap using the top 500 and top 1000 documents retrieved for each query (in addition to the results reported here which looked at the top 200 documents as suggested in the original paper), and found qualitatively very similar results thus indicating that the method itself, and not merely the parameter settings, led to the poor results in the context of the web.
5 Theoretical Analysis of Kernel and Set Overlap Measures
In
light of the anecdotal results in comparing our kernel function
with the set overlap measure, it is useful to mathematically
analyze the behavior of each measure in the context of large (and
continuously growing) document collections such as the web. We
begin by introducing some relevant concepts for this
analysis.Intuitively, this definition captures the notion that since a search engine generates a ranking of documents by scoring them according to various criteria, the scores used for ranking may only accurately resolve document relevance to within some toleration . This toleration factor reflects the inherent resolving limitation of a given relevance scoring function, and thus within this toleration factor, the ranking of documents can be seen as arbitrary. As we are interested in analyzing very large corpora and the behavior of the various similarity measures in the limit as the collections being searched grow infinitely large, we consider the situation in which so many relevant documents are available to a search engine for any given query that the set of topranked documents are all indistinguishable. To formalize this concept, let be the set of all (maximally ranked) documents which are all indistinguishable to search engine for query . Now we note that as the size of the collection grows to infinity (i.e., ) then , since there will be infinitely many documents that are equally relevant to a given query. Moreover, since the documents in are indistinguishably relevant to , we assume that the top results retrieved for query will be a uniformly random sampled subset of (with replacement, just to simplify the analysis in the limit as grows large). The use of a uniform distribution for sampling documents from can be justified by the fact that since all documents in are within the tolerance of the ranking function, their ranking is arbitrary. Since in this context there is no reason to prefer one particular distribution of rankings over any another, a maximally entropic distribution (i.e., uniform) is a reasonable model to use. In the sequel, assume that we are given two different queries and , which are so highly related to each other that (again, for simplicity) we assume . While in reality it is unlikely that two queries would share exactly the same set of maximally relevant documents, we make this assumption (which intuitively should lead to a very high similarity score between and ) to show that even under conditions of extreme similarity, there are shortcomings with the set overlap similarity measure. We show that the kernel function does not suffer from similar problems. Since we assume and always use the same search engine in our analysis, we will simply refer to (and thus ) as for brevity when there is no possibility of ambiguity.
1 Properties of Set Overlap measure
A desirable (and straightforward) corollary of this theorem, is that as we increase the results set size to capture all the relevant documents (i.e., ), the expected overlap measure approaches 1. Interestingly, however, for any fixed results set size , as , the expected normalized set overlap . This result suggests that even if two queries are so similar as to have the same set of highly relevant documents, in the limit as the collection size increases (and thus the number of relevant documents increases), the similarity as given by the set overlap measure will go to 0. Note that this problem would be even worse if we had not made the simplifying assumption that , as the set overlap measure would approach 0 even more quickly. While this result is not surprising, it does show the problem that arises with using such a measure in the context of a large collection such as the web. This is also borne out in the anecdotal results seen in Section 4.
2 Properties of Kernel function
Analyzing our kernel function under the same conditions as above, we find that the measure is much more robust to growth in collection size, making the measure much more amenable for use in broad contexts such as the web. Since the kernel function computes vectors based on the documents retrieved from the relevant set , we examine properties of the document vectors from this set. Namely, we assume that the document vectors generated from the documents in are distributed according to some arbitrary^{4} distribution with mean direction vector and a standard deviation , where measures the angular difference from . Such distributions, which are defined based on direction or angle, fall into the general class of circular distributions and a full discussion of them is beyond the scope of this paper (we refer the interested reader to work on document analysis using circular distributions, such as the von Mises distribution [7,2].) In this context, we note that for a given query is simply a sample mean from the distribution of document vectors in . This follows from the fact that the set of relevant documents retrieved in response to are simply samples from , and thus their corresponding document vectors are just samples from . The centroid of these vectors is defined to be the mean (direction) of the vectors, and is just the unit length normalized centroid (with the same direction as ) thus making it a sample mean of the vector directions in .Note that the bound on in Theorem 2 is independent of , even though it depends on . This follows from the fact that since the vectors that correspond to documents in are just samples from some true underlying stationary distribution (with mean and standard deviation ), the true standard deviation does not change as . Since is independent of , then so is . This implies that the kernel function is robust for use in large collections, as its value does not depend on the number of relevant documents, but simply on the directional dispersion (measured by the standard deviation over angles) of the vectors of the relevant documents. This property makes the kernel wellsuited for use with large collections such as the web. Furthermore, we can consider the more general (and realistic) case where the sets of indistinguishable results for queries and need not be the same (i.e., ), and now prove a more general result that subsumes Theorem 2 as a special case.
We note that Theorem 2 is simply a special case of Theorem 3, where , , and thus . Theorem 2 was derived separately just to provide a direct contrast with the normalized set overlap measure under the same conditions. Intuitively, Theorem 3 tells us that that the kernel function is essentially trying to measure the cosine of the angle between the mean vectors of the documents relevant to each query, with a ``noise'' term (proportional to ) that depends on the natural dispersion (standard deviation) of the documents relevant to each query and the size of the sample used to generate the query expansion. Thus, if we were to think of the set of documents that are relevant to a given query as the ``topic'' of , then the kernel is attempting to measure the mean ``topical'' difference between the queries, independent of the number of documents that make up each topic. This sort of behavior (and its independence from the overall collection size) is an intuitively desirable property for a similarity function.
6 Related Query Suggestion
Armed with promising anecdotal
evidence as well as theoretical results that argue in favor of
using this kernel when comparing short texts, we turn our
attention to the task of developing a simple application based on
this kernel. The application we choose is query suggestionthat
is, to suggest potentially related queries to the users of a
search engine to give them additional options for information
finding. We note that there is a long history of work in query
refinement, including the previously mentioned work in query
expansion [3,13], harnessing
relevance feedback for query modification [10], using precomputed
term similarities for suggestions [19], linguistically
mining documents retrieved in response to a search for related
terms and phrases [20,1], and even simply
finding related queries in a thesaurus. While this is certainly
an active area of work in information retrieval, we note that
improving query suggestion is not the primary focus of this work.
Thus, we intentionally do not compare our system with others.
Rather, we use query suggestion as a means of showing the
potential utility of our kernel function in just one, of
potentially many, realworld applications. We provide a user
evaluation of the results in this application to get a more
objective measure of the efficacy of our kernel. At a highlevel,
our query expansion system can be described as starting with an
initial repository of previously
issued user queries (for example, culled from search engine
logs). Now, for any newly issued user query , we can compute our kernel function for all and suggest related queries which have the highest kernel score with
(subject to some
postfiltering to eliminate related queries that are too
linguistically similar to each other). More specifically, we
begin by precomputing the query expansions for a repository
of approximately 116 million
popular user queries issued in 2003, determined by sampling
anonymized web search logs from the Google search engine. After
generating these query expansions, we index the resulting vectors
for fast retrieval in a retrieval system . Now, for any newly observed user query , we can generate its query expansion and use this entire expansion as a disjunctive
query to , finding all existing
query expansions in the
repository that potentially match . Note that if a query expansion indexed in does
not match in at least one term
(i.e., it is not retrieved), then we know since there are no common terms in and . For each retrieved query expansion , we can then compute
the inner product
.

To actually determine which of the matched queries from the repository to suggest to the user, we use the following algorithm, where the constant MAX is set to the maximum number of suggestions that we would like to obtain:
Given: user query , andHere denotes the number of terms in query . Thus, the test in Step 4.1 above is our postfilter to only add another suggested query if it differs by more than half as many terms from any other query already in the suggestion list (as well as the original user query ). This helps promote linguistic diversity in the set of suggested queries. The outputted list of query suggestions can be presented to the search engine user to guide them in conducting followup searches.
list of matched queries from repository
Output: list of queries to suggest
1. Initialize suggestion list
2. Sort kernel scores in descending order
to produce an ordered list
of corresponding queries .
3.
4. While and do
4.1 If then
4.1.1
4.2
5. Return suggestion list
7 Evaluation of Query Suggestion System
In order to evaluate
our kernel within the context of this query suggestion system, we
enlisted nine human raters who are computer scientists familiar
with information retrieval technologies. Each rater was asked to
issue queries from the Google Zeitgeist^{5} in a different month
of 2003 (since our initial repository of queries to suggest was
culled near the start of 2003). The Google Zeitgeist tracks
popular queries on the web monthly. We chose to use such common
queries for evaluation because if useful suggestions were found,
they could potentially be applicable for a large number of search
engine users who had the same information needs. Each rater
evaluated the suggested queries provided by the system on a
5point Likert scale, defined as:
1: suggestion is totally off topic.
2: suggestion is not as good as original query.
3: suggestion is basically same as original query.
4: suggestion is potentially better than original query.
5: suggestion is fantastic  should suggest this query
since it might help a user find what they're looking
for if they issued it instead of the original query.
In our experiment we set the maximum number of suggestions for each query (MAX) to 5, although some queries yielded fewer than this number of suggestions due to having fewer suggestions pass the postfiltering process. A total of 118 user queries, which yielded 379 suggested queries (an average of 3.2 suggestions per query) were rated. Note that some raters evaluated a different number of queries than other raters. In Table 2 we provide an example of two user queries, the query suggestions made using our system, the corresponding kernel scores, and the human evaluation ratings for the suggested queries. As can be seen in the first example, it is not surprising that users interested in the ``california lottery'' would prefer to find winning numbers rather than simply trying to get more information on the lottery in general. In the second example, we find that users querying for ``valentines day'' may be looking to actually send greeting cards. The suggestion ``new valentine one'' is actually referring to a radar detector named Valentine One and thus is clearly offtopic with regard to the original user query. Since each query suggestion has a kernel score associated with it, we can determine how suggestion quality is correlated with the kernel score by looking at the average rating over all suggestions that had a kernel score above a given threshold. If the kernel is effective, we would generally expect higher kernel scores to lead to more useful queries suggested to the user (as they would tend to be more ontopic even given the postfiltering mechanism that attempts to promote diversity among the query suggestions). Moreover, we would expect that overall the suggestions would often be rated close to 3 (or higher) if the kernel were effective at identifying query suggestions semantically similar to the original query.
8 Conclusions and Future Work
We have presented a new kernel
function for measuring the semantic similarity between pairs of
short text snippets. We have shown, both anecdotally and in a
humanevaluated query suggestion system, that this kernel is an
effective measure of similarity for short texts, and works well
even when the short texts being considered have no common terms.
Moreover, we have also provided a theoretical analysis of the
kernel function that shows that it is wellsuited for use with
the web. There are several lines of future work that this kernel
lays the foundation for. The first is improvement in the
generation of query expansions with the goal of improving the
match score for the kernel function. The second is the
incorporation of this kernel into other kernelbased machine
learning methods to determine its ability to provide improvement
in tasks such as classification and clustering of text. Also,
there are certainly other potential webbased applications,
besides query suggestion, that could be considered as well. One
such application is in a question answering system, where the
question could be matched against a list of candidate answers to
determine which is the most similar semantically. For example,
using our kernel we find that: K(``Who shot Abraham Lincoln'',
``John Wilkes Booth'') = 0.730. Thus, the kernel does well in
giving a high score to the correct answer to the question, even
though it shares no terms in common with the question.
Alternatively, K(``Who shot Abraham Lincoln'', ``Abraham
Lincoln'') = 0.597, indicating that while the question is
certainly semantically related to ``Abraham Lincoln'', the true
answer to the question is in fact more semantically related to
the question. Finally, we note that this kernel is not limited to
use on the web, and can also be computed using query expansions
generated over domainspecific corpora in order to better capture
contextual semantics in particular domains. We hope to explore
such research venues in the future.
Acknowledgments
We thank Amit Singhal for many invaluable discussions related to this research. Additionally, we appreciate the feedback provided on this work by the members of the Google Research group, especially Vibhu Mittal, Jay Ponte, and Yoram Singer. We are also indebted to the nine human raters who took part in the query suggestion evaluation.Bibliography
 1
 P. Anick and S. Tipirneni.
The paraphrase search assistant: Terminological feedback for iterative information seeking.
In Proceedings of the 22nd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 153159, 1999.  2
 A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra.
Clustering on the unit hypersphere using von misesfisher distributions.
Journal of Machine Learning Research, 6:13451382, 2005.  3
 C. Buckley, G. Salton, J. Allan, and A. Singhal.
Automatic query expansion using SMART: TREC 3.
In The Third Text REtrieval Conference, pages 6980, 1994.  4
 N. Cristianini and J. ShaweTaylor.
An Introduction to Support Vector Machines and Other Kernelbased Learning Methods.
Cambridge University Press, 2000.  5
 N. Cristianini, J. ShaweTaylor, and H. Lodhi.
Latent semantic kernels.
Journal of Intelligent Information Systems, 18(2):127152, 2002.  6
 S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,
and R. Harshman.
Indexing by latent semantic analysis.
Journal of the American Society for Information Science, 41(6):391407, 1990.  7
 I. S. Dhillon and S. Sra.
Modeling data using directional distributions, 2003.  8
 S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami.
Inductive learning algorithms and representations for text categorization.
In CIKM98: Proceedings of the Seventh International Conference on Information and Knowledge Management, 1998.  9
 L. Fitzpatrick and M. Dent.
Automatic feedback using past queries: Social searching?
In Proceedings of the 20th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 306313, 1997.  10
 D. Harman.
Relevance feedback and other query modification techniques.
In W. B. Frakes and R. BaezaYates, editors, Information Retrieval: Data Structures and Algorithms, pages 241263. Prentice Hall, 1992.  11
 T. Joachims.
Text categorization with support vector machines: learning with many relevant features.
In Proceedings of ECML98, 10th European Conference on Machine Learning, number 1398, pages 137142, 1998.  12
 J. S. Kandola, J. ShaweTaylor, and N. Cristianini.
Learning semantic similarity.
In Advances in Neural Information Processing Systems (NIPS) 15, pages 657664, 2002.  13
 M. Mitra, A. Singhal, and C. Buckley.
Improving automatic query expansion.
In Proceedings of the 21st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 206214, 1998.  14
 V. V. Raghavan and H. Sever.
On the reuse of past optimal queries.
In Proceedings of the 18th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 344350, 1995.  15
 G. Salton and C. Buckley.
Term weighting approaches in automatic text retrieval.
Information Processing and Management, 24(5):513523, 1988.  16
 G. Salton and M. J. McGill.
Introduction to Modern Information Retrieval.
McGrawHill Book Company, 1983.  17
 G. Salton, A. Wong, and C. S. Yang.
A vector space model for automatic indexing.
Communications of the ACM, 18:613620, 1975.  18
 A. Vinokourov, J. ShaweTaylor, and N. Cristianini.
Inferring a semantic representation of text via crosslanguage correlation analysis.
In Advances in Neural Information Processing Systems (NIPS) 15, pages 14731480, 2002.  19
 B. Vlez, R. Wiess, M. A. Sheldon, and D. K. Gifford.
Fast and effective query refinement.
In Proceedings of the 20th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 615, 1997.  20
 J. Xu and W. B. Croft.
Query expansion using local and global document analysis.
In Proceedings of the 19th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 411, 1996.
Footnotes
 ... billion^{1}
 Leading search engines claim index sizes of at least 20 billion documents at the time of this writing.
 ... snippet^{2}
 While the real focus of our work is geared toward short text snippets, there is no technical reason why must have limited length, and in fact can be arbitrary text.
 ... terms.^{3}
 We prefer the term multifaceted over ambiguous, since multifaceted terms may have the same definition in two contexts, but the accepted semantics of that definition may vary in context. For example, the term ``travel'' has the same definition in both the phrases ``space travel'' and ``vacation travel'', so it is (strictly speaking) not ambiguous here, but the semantics of what is meant by traveling in those two cases is different.
 ... arbitrary^{4}
 The distribution is arbitrary up to the fact that its first two moments, mean and variance, exist (which is a fairly standard and nonrestrictive assumption).
 ... Zeitgeist^{5}
 http://www.google.com/intl/en/press/zeitgeist.html