Answering Bounded Continuous Search Queries in the World Wide Web

Answering Bounded Continuous Search Queries
in the World Wide Web

Dirk Kukulenz

Institute of Information Systems
Ratzeburger Allee 160, 23538 Luebeck, Germany

Alexandros Ntoulas

Microsoft Search Labs
1065 La Avenida, Mountain View, CA 94043, USA
(Work performed while author was at UCLA Computer Science Department.)

ABSTRACT

Search queries applied to extract relevant information from the World Wide Web over a period of time may be denoted as continuous search queries. The improvement of continuous search queries may concern not only the quality of retrieved results but also the freshness of results, i.e. the time between the availability of a respective data object on the Web and the notification of a user by the search engine. In some cases a user should be notified immediately since the value of the respective information decreases quickly, as e.g. news about companies that affect the value of respective stocks, or sales offers for products that may no longer be available after a short period of time.
In the document filtering literature, the optimization of such queries is usually based on threshold classification. Documents above a quality threshold are returned to a user. The threshold is tuned in order to optimize the quality of retrieved results. The disadvantage of such approaches is that the amount of information returned to a user may hardly be controlled without further user-interaction. In this paper, we consider the optimization of bounded continuous search queries where only the estimated best k elements are returned to a user. We present a new optimization method for bounded continuous search queries based on the optimal stopping theory and compare the new method to methods currently applied by Web search systems. The new method provides results of significantly higher quality for the cases where very fresh results have to be delivered.

Categories & Subject Descriptors

H.3.3Information Storage and RetrievalInformation Search and Retrieval[Search process, Selection process, Information filtering] H.3.7Information Storage and RetrievalDigital Libraries[Dissemination]

General Terms

Algorithms, Experimentation

Keywords

monitoring search, continuous queries, optimal stopping

1 Introduction

The Web is a constantly updating source of information. A large number of latest developments, events and commentaries are constantly being posted on the Web in the form of blogs, news articles, usenet posts etc. In order to help the users avoid the tedious task of periodically searching for updated information on the Web, some of the search engines provide an ``alert'' service (e.g. Google alerts [16] or Live alerts [26]). The main idea is to let the users create profiles (e.g. by specifying a few query keywords) describing the information for which they would like to receive updates. After that, the search engines are continuously monitoring the newly collected information from the Web and will alert the user (e.g. through email) whenever there is new information that matches the user's profile. In this way, the user can stay current with a developing news story, track a good deal on a desired product or follow who is writing about her blog.

The continuous monitoring of the Web to match a given set of user-defined keywords, is also known as continuous querying. Typically, the results of a continuous query can be returned to the user either in an ``as-found'' basis or ``in-batch'' after a particular time interval (e.g. once a day).

Although running continuous queries on the Web can potentially help the users to stay current with important updates, in general, the amount of information returned as updates to the user can be ``unbounded''. For example, if the user is following a very controversial or popular topic, she may receive hundreds of updated pages as an alert, and may thus be overwhelmed by this huge amount of information.

Since a user typically can allow a limited time for comprehending the information delivered to her, one way to alleviate this problem is to allow the user to restrict (or bound)1 the number of returned results within a particular time interval. More specifically, the user may decide that, say, every day, she is interested in reading only the 10 most relevant updates to her continuous query and would like to receive only those updates. For the cases where it is acceptable to return the results in-batch the solution is straightforward: we first collect all the relevant results within a day, we rank them and then return the top-10 to the user. However, in the cases where the ``freshness'' of the results is very important, and thus we need to return them as-found, the user is not willing to wait until we collect all the relevant results and return them at the end of the query period. For example, if a user is tracking Web pages describing digital cameras offered for sale, she would like to know the 10 best pages according to some specification as soon as they appear, since the cameras may be sold after a short period of time.

Returning the $k=10$ best results in an as-found basis to a given continuous query involves two main challenges. First, the potentially relevant results within a time period (e.g. a day) are not known in advance. Without knowing all the relevant results how can we find the top-$k$ among them to return to the user? Second, the points in time where the top-$k$ relevant results will appear are also not known in advance. Given that the freshness of results is very important, how can we ensure that we return the top-$k$ results as soon as they appear?

Regarding the first challenge, clearly we will have to wait until we see all results in order to calculate the exact top-$k$. However, in a practical scenario, we may safely assume that the user is willing to exchange some imprecision in the top-$k$ results for a greater freshness. For example, in our digital camera example, the user may be happy to receive the 10 deals out of which only 9 belong to the top-10, but receive them soon enough to actually be able to buy the products.

Given this relevance/freshness tradeoff, in this paper we present an optimization method for bounded continuous search queries on the Web. More specifically, our goal is to extract the top-$k$ relevant pages that appear over a specified time interval on the Web and return them to the user as soon as possible. Our proposed solution utilizes principles from the field of optimal stopping [34] in order to realize fresh, high quality and a bounded number of search results during a specified execution time of a continuous query. Optimal stopping is a well-known problem in the field of financial mathematics. The main idea of this paper is to consider the development of the relevance of versions of Web pages as relevance charts and to treat the problem of estimating top-k results as-found similar to a basic problem in financial mathematics, the problem of finding the buying or selling points for stocks at optimal prices.

In summary, our contributions in this paper are:

In the next section we start our discussion by presenting the strategy that is presently employed by the current search engines. In section [*] we demonstrate that in the cases where we need to obtain information as up-to-date as possible, current approaches may return sub-optimal results to the users. In Section [*] we define a new language for bounded continuous search queries and present our optimization approach which utilizes principles from the field of optimal stopping . Finally, in the experimental section [*] we verify that our new method can generate fresher and more relevant results for a variety of continuous queries. We conclude with an overview of related research and a summary.

2 Background


1 Periodic Evaluation Method

In this section we give basic definitions and present a common strategy to process bounded continuous search queries that is applied by Web search systems and used as a reference strategy in this paper.
In this work we consider a simple stream of documents or versions of a specific document $\{ d_{t_1}, d_{t_2}, \ldots d_{t_n} \}$. The index $t_j$ corresponds to the time a document is obtained from the Web in a push or a pull manner [
21].2
Definition:
\begin{definition}
A {\it bounded (document) filtering method} is a method that...
...d a distance function
between user profiles and documents.
\end{definition}
We consider bounding conditions that are specified by the maximal number of documents to be returned. A bounding condition provided by a user corresponds to the maximal information load a user is willing to accept with respect to a query. It is obvious that threshold-based information filtering methods presented in the field of topic tracking and detection [
2] are not bounded. We consider query profiles that are determined by a set of query terms provided by a user. We may thereby assume that query profiles, similar to documents, may be expressed in a term vector space. Well-known methods from Information Retrieval may therefore be applied to compute a distance between a query profile and a document.
Based on the tf-idf measure [
32] we may apply the function
\begin{displaymath}
R_{j,Q}=\sum_{\texttt{term:}\;w_k \in Q} (0.5+0.5\frac{tf_{j,k}}{tf_{j, max}})idf_k
\end{displaymath} (1)

to compute the distance between a document $d_j$ and a query profile (i.e. a set of query terms) $Q$ [
39]. In the formula the term $tf_{j,i}$ (term frequency) is the frequency of a term $w_i$ in a document $d_j$. N is the number of documents in the collection, $idf_i:=\log(N/df_i)$ is the inverse document frequency where $df_i$ is the number of documents in the collection that contain $w_i$. Formula 1 is one way to estimate the relevance of a document with respect to a query. In this paper in order to keep our discussion more concise whenever we talk about relevance we will use equation 1, but in general we could use any other estimation.

A problem of this measure is the computation of the $df_i$ term. At a specific point in time the entire set of documents appearing in the stream is not yet available. Possible approaches to this problem have been presented in [38], [18] and [31]. In this work we consider each version of the information source or incoming document as a single document 'collection'. The $df_k$-term is based only on the state of the information source at the current point in time. If $\theta_{t_i}$ is the state of the information source $\theta$ at time $t_i$ we denote the respective ranking of document $d_j\in \theta_{t_i}$ with respect to the query $Q$ as $R^{\theta_{t_i}}_{j,Q}$. As described above in this work we make the simplifying assumption that at each point in time a single (new) document is available. We may therefore define $d_j:=\theta_{t_i}$ and $R_{i,Q}:=R^{\theta_{t_i}}_{j,Q}$.
The function $R_{i,Q}$ is used to obtain relevance estimations for new documents in the sequence with respect to a query profile. In this work we simplify the search for optimal documents by defining quality as the estimation provided by the ranking function $R_{i,Q}$. Quality is used as a synonym for the relevance of documents.
\begin{definition}
The {\it quality} of a document $d_i$ with respect to a query profile $Q$
is defined by the ranking function $R_{i,Q}$.
\end{definition}
In order to find optimal documents we have therefore simply to find documents with the highest ranking values according to $R_{i,Q}$. This quality definition is used because the development of estimation functions similar to (1) is not the focus of this paper but has been examined thoroughly in the field of information retrieval. In this paper we show that even if an optimal estimator for the quality of documents is given (or assumed) the optimization of bounded continuous search queries is not a trivial problem.
Based on the previous definitions we may now define a common strategy to process bounded document filtering.
\begin{definition}
A period evaluation (PE(n)) method is a selection strategy f...
...st documents is returned according to
the bounding condition.
\end{definition}

In this work we consider a PE method that applies function (1) as the ranking or evaluation function. Obviously a PE method is a bounded filtering method according to definition 1 due to the bounding condition, which may e.g. imply a maximal number of pages returned in each evaluation period or a total maximal number. In the latter case the number of returned pages per evaluation period is determined by (the closest integer to) the total number of documents to be returned according to the bounding condition divided by $n$, the number of evaluation periods. A PE-query may e.g. inquire about the 'best 10' pages each day with respect to a set of query terms, as e.g. realized by the GoogleAlert system [16]. The PE-method is illustrated in figure [*]. In the figure '$\times$'-symbols denote ranking values depending on the current state of a specific data source or a new document in a stream, a query profile and a ranking function (as e.g. function (1)). In this case the query execution time is $qet=e_2-e_0$. There are two evaluation periods. The bounding condition is 4, i.e. the best two documents in each evaluation period have to be selected and forwarded to a user as indicated by circled ranking values.


2 The freshness/quality tradeoff

In this section we demonstrate cases where the PE strategy is sub-optimal and thereby illustrate the tradeoff-problem between freshness of information and the quality of retrieved results.
It is obvious from figure [*] that by applying the PE strategy documents are returned with a certain delay between the point in time when a document is obtained3 and the time at the end of an evaluation period when a document or a respective notification is forwarded to a user. We may therefore define a freshness (or reciprocal: delay) metric as:
\begin{displaymath}
delay:= \frac{1}{k(e_{\sharp e}-e_0)}\sum_{n=1,\ldots \sharp e}
\sum^{a^{n}}_{j=1}e_n-d^{n}_j
\end{displaymath}

where $e_1, e_2 \ldots $ are the end points of the evaluation periods (figure [*]). At these points in time results are sent to a user. $\sharp e$ is the number of evaluation periods and $k$ is the number of requested (estimated best) objects. $a^{n}$ is the number of optimal elements to be selected in the n-th evaluation period (with $\sum^{\sharp e}_{i=1}a^i=k$) and $d^{n}_j$ is the j-th best element selected in the n-th evaluation period.
It may now be shown that a PE-method is not optimal if a high freshness is required.

Theorem 1   The PE-method may choose sub-optimal documents if a high freshness value is required, i.e. if $delay \rightarrow 0$.


The validity of this theorem may be demonstrated by considering the example in figure 1. If the best documents have to be selected that appear during the query execution time $[e_0, e_2]$ the optimal strategy is to store documents and to wait until $e_2$. At this point the 4 highest ranked documents may be returned if we assume that the bounding condition implies a number of 4 documents to be returned. However, as shown in the example in figure 1 the delay-value as defined above may be significant. href="#foot55">1 In order to acquire fresher results, a larger number of evaluation periods has to be considered. In figure 1 the query period is subdivided into 2 evaluation periods. The bounding condition in this case is 2 documents for each of the two evaluation periods in order to fulfill the global bounding condition of maximal 4 documents. The freshness of retrieved optimal documents is obviously increased. However the selected documents (as illustrated by circled ' $\times$'-symbols) are no longer the optimal ones and represent a suboptimal choice.

The IW3C2 Logo

Figure: The PE(n) strategy: A number of best items according to the bounding condition is returned to a user after each of the n evaluation periods.


The reason for this decrease of retrieval quality is the missing knowledge about future document rankings if objects are evaluated at an earlier point in time. This is obviously an intrinsic problem if the optimization of bounded continuous search queries is concerned. There is no method that has information about future data objects and therefore each conceivable method is subject to this problem, which we denote as freshness/quality tradeoff. lemmaLemma
\begin{lemma}
If the freshness of the information returned by a bounded filteri...
...es a decrease of the quality of retrieval results (and vice versa).
\end{lemma}
It has to be noted that this tradeoff-problem is not valid for threshold-based filtering methods. In the example in figure [*] we wouldn't have the restriction of the maximal number of objects to be returned and could forward each object above a specified threshold. However in this case not knowing future ranking values, a suboptimal threshold may be chosen, which affects precision and recall results.


3 A query method for bounded continuous search (BCS) queries


1 The query language 'BCSQL'

In this section we describe the main syntax of a new query language to state bounded continuous search (BCS) queries and in subsequent sections we describe how these queries are answered within our prototype system.
At a high level, we employ a query model similar to the OpenCQ language [
23]. In OpenCQ a continuous query is a triple $(Q, T_{c}, Stop)$ consisting of a normal query $Q$ (e.g. written in SQL), a trigger condition $T_{c}$ and a termination criterion $Stop$. In this work we consider only time-based trigger conditions. We extend the basic notation of OpenCQ in order to support continuous search queries. For this purpose we assume the availability of a ranking function for query results as provided by (1). A main extension with respect to many continuous query languages is the possibility to provide a bounding condition. In the considered query language a user has to define the number of estimated best results to be returned. This feature is well-known from common search engines. The best 'n' results are displayed on the first result page. A further specific attribute is the requirement to specify a user profile consisting of query terms. An example for the considered query language is the following:
CREATE BCSQ:
SalesWatch as
Query:
SELECT ESTIMATED$^{method}$ BEST 10
FROM SERVER www.ebay.com
WHERE query='camera 12 mega flash'
Trigger:
60 minutes
Start:
now
Stop:
7 days
Delay:
0 minutes
In this query the user requests the best documents on the server www.ebay.com over a period of 7 days with respect to the query terms 'camera 12 mega flash'. The trigger condition in this query language is used to define the incoming document stream in a pull-based manner. In this example the data source is reloaded every hour. Since the user in this example wants to buy the respective camera, she is interested in an immediate notification if relevant pages appear. The delay parameter (Delay=0) indicates that results should be delivered immediately after detection on the Web.5 By the 'BEST 10' directive she may limit the number of irrelevant pages returned by the query engine. The 'ESTIMATED$^{method}$ BEST' directive in the query denotes that, given an appropriate ranking measure and estimation method, the query engine should estimate and return the best documents. In general it is not known, if versions of a data source that appear in the future have a higher ranking and thereby declassify the current version as (relatively) irrelevant. The current version would thereby create 'costs' in terms of information overload and a decreased 'precision', if returned to a user.

In the example the current version may contain the terms 'camera 12 mega' but a future version may contain the terms 'camera 12 mega' and 'flash' which declassifies the current version. However if the query engine waits until all versions have been available, the respective cameras may already be sold.

In the following we refer to this query language as bounded continuous search query language (BCSQL).


2 Answering queries: selecting the best k

In this section we give an introduction into the considered optimal stopping problem, frequently denoted as 'Secretary Selection problem' (SSP). We first summarize results from the literature that are the basis for the optimization method in this paper.
In the classical SSP a sequence of ranked objects is presented to a 'player'. The player has the task to choose the best object. The choice is based only on previous observations. The ranking values of the objects are assumed to be distinct and equally distributed.6 An object has to be chosen immediately when presented to the player and may not be chosen later. This basic problem has been analyzed e.g. in [
14] and [12]. A well-known strategy for this problem is to observe a number of candidates without choosing them. The respective ranking values of candidates are stored. After this observation period the first subsequent candidate is chosen that has a higher ranking than the maximal ranking value of the candidates in the observation period. The main problem then is to find an optimal length of the observation period. An optimal strategy for this problem in order to maximize the probability of finding the best candidate is to choose an observation period of $\frac{n}{e}$, where $n$ is the number of candidates and $e\sim 2.7$ is the Euler number. In other words approximately one third of the candidates should be observed without being chosen. This result has been proved 'in the limit', for $n\rightarrow \infty$. Further strategies for the basic SSP are discussed in [19]. Extensions of the basic SSP have been proposed in [14] and [30].
In contrast to the problem of selecting one single best candidate, in this paper we consider the more general problem of selecting the best $k$ candidates in a stream of ranked documents by $k$ choices, we denote as k-SSP. An obvious extension of the single SSP is not to consider a single observation period (needed to adjust the optimal selection probability) but to consider $k$ observation periods. Our method, following an approach in [
15], first implies the choice of $k$ starting times $t^{\star}_1,\ldots, t^{\star}_k$.7 After rejecting the first $1,\ldots t^{\star}_1-1$ candidates, the first candidate considered for selection is examined at or after time $t^{\star}_1$.
(1) If $i+j$ candidates have already been examined with $i$ objects accepted and $j$ rejected, the $(i+j+1)^{st}$ object is chosen if it is at least better than one of the $i$ objects already selected. It is rejected if it is worse than at least one of the $j$ objects rejected.
(2) If among all the candidates examined so far the $(i+j+1)^{st}$ is ranked $i+1$ (between the accepted and the rejected objects) it is chosen if $i+j+1 \geq t^{\star}_{i+1}$ and rejected if $i+j+1 < t^{\star}_{i+1}$, where $i+j+1$ is the current point in time.
(3) If $m$ choices have been made where $m \leq k$ and $k-m$ candidates are left with respect to the entire sequence of input candidates to be evaluated, all of the remaining candidates must be chosen in order to guarantee that $k$ objects are chosen.
In this paper we do not provide a proof for the previous strategy but in the experiments the algorithm is evaluated with artificial and real relevance sequences.

The IW3C2 Logo

Figure: A strategy for selecting the best two candidates in a stream of documents. Rejected candidates are marked by rectangles, accepted candidates by circles.

An example is shown in figure [*]. We denote the sequence of candidates as $d_{1},d_{2} \ldots d_{n}$ appearing at times $t_1,t_2, \ldots t_n$ respectively. We consider a number of 2 candidates to be returned and two starting points $t_1^{\star}:=t_3$ and $t_2^{\star}:=t_7$. Candidates $d_1$ and $d_2$ are rejected due to the first observation phase. Candidate $d_3$ at $t_1^{\star}$ is accepted because it is better than all of the previously rejected candidates. $d_4$ is better than all the previously rejected candidates and worse than all the previously accepted candidates. It is rejected because it appears before the stopping time $t_2^{\star}$. It would have been accepted if $t_4 >= t_2^{\star}$. $d_6$ is accepted because it is better than at least one previously accepted candidate. Due to the previous choice of two candidates, candidates $d_7$ and $d_8$ are not considered.
Based on this selection strategy the main problem is to find optimal times $t^{\star}_1,\ldots, t^{\star}_k$ in order to maximize the probability of choosing the best $k$ candidates. Due to the equal distribution of ranking values intuitively the starting times should be spread evenly over the considered time period. In [
15] a strategy is proposed to position starting times that is proved to be optimal and applied in section [*].

3 Application of the k-SSP for BCS processing

In the SSP as in the BCSQL optimization problem the candidates or versions of the data source appear sequentially ordered one after another. There exists a definite starting point and a definite endpoint in the BCSQL problem. In the SSP the starting point is determined by the time of the appearance of the first, the endpoint by the appearance of the last candidate. The trigger condition in the BCSQL corresponds to the considered candidates in the SSP. Each candidate is assigned a ranking value in the SSP. In the SSP the ranking values are assumed to be distinct. In the BCSQL problem this property depends on the applied ranking function and may not be fulfilled (especially if the data source did not change between 2 trigger executions). The condition of different ranking values may be guaranteed artificially by considering ranking values that depend on time, i.e. versions appearing later in the sequence are assigned a lower ranking value. In the SSP as in the CQ problem the selection strategy may be based only on previous observations. No information about future objects is available. In contrast to the general BCS query language in section [*] the delay parameter is not adjustable if the SSP is applied to the optimization of retrieval results. Results are returned immediately (delay=0) if estimated to have a high ranking.


4 A query engine for BCS processing

Figure [*] shows the basic steps of the BCS query processing algorithm. The input of the algorithm are the start and the end time of the continuous query, the trigger condition, a value '$k$' for the number of estimated best items to be chosen and a query profile $Q$.
Based on the start, the end time and the trigger condition in steps 1 and 2 the number of reload operations (i.e. the number of 'candidates') and the times $t_1,\ldots t_N$ of reload operations are computed. Applying the k-SSP strategy in section [*] the starting times $t^{\star}_1,\ldots, t^{\star}_k$ are computed based on the number of candidates and the number 'k' of highly ranked candidates to be chosen. At time $t_1$ the first candidate is loaded in step 7 and the ranking with respect to the search query '$Q$' is computed (section 2.1). The ranking is compared to previous ranking values in step 9 which are available in the list $rankList$ and the relative ranking is computed. In step 10 it is determined if a new version is chosen as a highly ranked candidate according to section [*]. In figure [*] we assume the availability of a function isSelected(C) that indicates, if a candidate C has been selected. In step 11 the new candidate C is inserted into the list rankList at the position determined by the ranking value. If the candidate is chosen, a message is sent to the client. Finally the algorithm waits until the time of the subsequent reload time in step 13 and returns to step 6.

BCS-Query-Processing (Input: start-time s, end-time e, trigger-condition tc, 'number of best choices' k, Query Q)
1
rankList := null
2
compute number of candidates $N$ based on s,e,tc
3
compute reload times $t_1,\ldots t_N$ based on s,e,tc
4
compute starting times $t^{\star}_1, \ldots, t^{\star}_k$ based on $k$, $N$
5
wait until $t_1$
6
for(i = 1,...$N$)
7
load candidate $C=d_{t_i}$
8
compute ranking $R_{C,Q}$ based on $C$, $Q$
9
compare $R_{C,Q}$ to previous rankings
10
select or reject $C$ according to
selection strategy )
11
insert $(R_{C,Q}, C, isSelected(C))$ into rankList
12
if( isSelected(C) ) send message to client
13
wait($t_{i+1}-t_i$)




4 Experiments

In the following experiments we compare the new BCS query method to the period evaluation (PE) method. The considered quality parameters are the freshness of the retrieved information according to eq. ([*]) and the quality of search results according to definition [*]. Applying the k-SSP method (figure [*]) objects that are estimated to be relevant are returned to a user immediately after detection on the Web.8 In this case we assume an immediate decision of the filtering method and the delay value in formula ([*]) is 0.
In definition [*] we defined the quality or relevance of a single document retrieved by a search engine. In order to measure the quality of a set of retrieved documents we build the sum of quality values of the individual documents. In [
20] a very similar relevance measure is presented that is based on graded relevance assessments (in contrast to binary relevance assessments usually considered in IR). In [20] the functions
\begin{displaymath}
gr:=\sum_{x\in \text{retr}}\text{relevance}(x)/\sum_{x \in D}\text{relevance}(x)\\
\end{displaymath}


\begin{displaymath}
gp:=\sum_{x\in \text{retr}}\text{relevance}(x)/\vert retr\vert \\
\end{displaymath}