The World Wide Web consists of an ever-increasing collection of
decentralized web pages that are modified at unspecified times by
their owners. Current search engines try to keep up with the
dynamics of web by crawling it periodically, in the process
building an index that allows better search for pages relevant to a
topic or a set of keywords. Clearly, any good crawling technique
needs to consider the change behavior of web pages. However, the
algorithms used for maintaining a crawled collection, and the
typical frequency of (re-)crawling are insufficient to handle a
class of queries known as Continuous Queries (for example,
see) in which the user expects to be
continuously updated as and when new information of relevance to
his/her query becomes available. For example, consider a user who
wants to monitor a hurricane in progress with the view of knowing
how his/her town will be affected by the hurricane. Obviously, a
system which responds taking into account the continuous updates to
the relevant web pages will serve the users better than another
which, say, treats the query as a discrete query, i.e.,
returns an answer only when the query is submitted.
Not surprisingly, the problem of keeping track of the dynamics of the web becomes inherently different for the continuous query case compared to the discrete query case. We use the term monitoring to explicitly account for the differences from the classical crawling problem. A monitoring task fetches a web page, much like a crawler does, but with the goal of fetching new information relevant to one or more queries while a crawl is not done with any specific user request in mind. The work involved in handling continuous queries is portrayed in Figure 1. For continuous queries, since the system should maintain the freshness or currency of responses to users, the problem translates to one of (a) knowing which pages are relevant, (b) tracking the changes to the pages, to determine the characteristics of changes to these pages, and from these, (c) deciding when to monitor the pages for changes, so that responses are current. The last problem is the focus of this paper, and has several subproblems: allocating the resources needed for monitoring the pages, scheduling the actual monitoring tasks, and then monitoring. Specifically, in this paper, we address the problem of distributing a given number of monitoring tasks among the pages whose changes need to be tracked so as to respond to a set of continuous queries.
In Figure 1, the feedback arcs from the monitoring phase to the earlier phases indicate that observations made during the monitoring phase can be used to adjust subsequent decisions.
It could be argued that discrete queries posed every so often can be considered to be equivalent to continuous queries but the following reasons should help dispel this misconception: First, determining the next time when the discrete query should be posed by the user is highly non-trivial. If the time-interval is kept small then it may induce unnecessary load on the system, particularly when the updates are not frequent. If we set the time-interval to be large, it may lead to loss of information if updates are more frequent than expected. Second, in contrast to discrete queries which have zero lifetimes, continuous queries have a non-zero lifetime, enabling a query system to study a query's characteristics carefully and thereby answer it more efficiently. Unlike in the case of discrete queries, the time taken to provide the system's first response to a continuous query may not be as important as the maintenance of currency during all the responses.
The above discussion makes it clear that not only the nature of the crawling problem but optimization goals also become different when we move from discrete to continuous queries. To this end, our optimization metric minimizes the information loss, compared to an ideal monitoring algorithm which can monitor every change of a page.
Our contributionsIn this paper, we introduce (Continuous Adaptive Monitoring) (CAM), a technique to monitor changes. The goal of CAM's resource allocation algorithm is to allocate limited monitoring resources across pages so as to minimize the information loss compared to an ideal monitoring algorithm which can monitor every change of a page. This goal also differentiates crawling from monitoring. Whereas most of the earlier crawling strategies assume a Poisson update process, CAM makes no such assumptions; it tracks the changes to pages dynamically and evolves the statistics relating to these changes, making CAM more practical and robust.
We show the optimality of CAM's resource allocation algorithm under specific change scenarios and formally prove that, in the continuous query case, that proportional allocation, in which the pages with high frequency of change are allocated more monitoring tasks, works better than uniform allocation, which allocates an equal number of monitoring tasks to each page, independent of its change frequency. On the surface, this seems to contradict a prior result  that the uniform allocation of crawling resources produces better results than its proportional counterpart. We provide an intuitive explanation for this apparently surprising behavior. This also emphasizes why CAM is distinct from earlier work on crawl maintenance for answering discrete queries.