Retrieving and Organizing Web Pages by "Information Unit"
C&C Research Laboratories, NEC USA, Inc.
110 Rio Robles, M/S SJ100, San Jose, CA 95134, USA
Email:{wen,candan,qvu,agrawal}@ccrl.sj.nec.com
Tel:(408)9433008 Fax:(408)9433099
Copyright is held by the author/owner(s).
WWW10, May 15, 2001, Hong Kong.
ACM 1581133480/01/0005.
Abstract:
Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to efficiently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.Keyword: Web proximity search, link structures, query relaxation, progressive processing
Introduction
To find the announcements for conferences whose topics of interests include
WWW, a user may issue a query ``retrieve Web documents which contain keywords
WWW,
conference,
and topics.'' We issued the above query to an Internet search engine
and surprisingly found that the returned results did not include many relevant
conferences. The main reason for this type of false drops is that the contents
of HTML documents are often distributed among multiple physical pages and
are connected through links or frames. With the current indexing and search
technology, many search engines retrieve only those physical pages
that have all the query keywords. It is crucial that the structure of Web
documents (which may consist of multiple pages) is taken into consideration
for information retrieval.
In this paper, we introduce the concept of an information unit, which can be viewed as a logical Web document, which may consist of multiple physical pages as one atomic retrieval unit. The concept of information units does not attempt to produce additional results that are likely to be more than users can browse. Instead, our approach is to keep those exactly matched pages at the top of the ranked list, while merging a group of partially matched pages into one unit in a more organized manner for easy visualization and access for the users. In other words, unlike the traditional keywordbased query relaxation, the information unit concept enables Webstructure based query relaxation in conjunction with the keywordbased relaxation.
Let us denote the set of Web pages containing a given keyword, , as . Furthermore, for a given , let us define as the set of Web pages that contain a link to at least one of the pages in . Similarly, let us denote as a set of pages that are linked from at least one of the pages in . In the example in Figure 1, , , and represent , , and respectively.
Figure 2 illustrates the query results that are generated progressively for a query with two keywords. In this figure, solid lines indicate reuse and dotted lines indicate derive. We denote class 0 information units as documents which contain both and ; class i information units are a pair of documents such that one contains and the other and there is a path of length between them. Note that (1) the intermediate query results of Class 0, and , are reused (indicated by solid arrows) while processing the query for Class 1; (2) and are derived using Class 0 results (indicated by dashed arrows); and (3) if necessary, computation of () and () can be parallelized.
 Users are not just interested in a single result, but the top results.
 While generating result, we want to reuse existing results.
 And, since the Web is large and usually a simple search results in thousands of hits, preprocessing and any computation which requires touching or enumerating all pages is not feasible.
The rest of the paper is organized as follows. In Section 2, we provide formal definitions of the more general problem of information unitbased retrieval. In Section 3, we describe the query processing techniques and generalize the framework to more complex scenario. In Section 4 we describe how our technique can be extended to deal with fuzziness in keywordbased retrieval, such as partial match and different importance of query keywords. In Section 5 we present experimental results for evaluating the proposed algorithm on actual Web data. In Section 6, we review related work and compare it with our work. Finally, we present concluding remarks in Section 7.
Data and Query Models for Retrieval by Information Unit
In this section, we describe the query model for information unitbased
retrieval and start with the definitions for the terms used in the rest
of the paper:
 The Web is modeled as a directed graph , where is the set of physical pages, is the hyper or semanticlinks connecting these pages. We also define the undirected Web as , where is the same as except that the edges in are undirected.
 The dictionary, , is a finite set of keywords that can be used for querying the Web. A query, , which is a list of keywords, is a subset of ; i.e., .
 There is a pagetokeyword mapping , which lists the keywords in a given page. There is also a keywordtopage mapping , which lists the set of Web pages that contain the keyword.
 Finally, we also assume that there is a cost function, (or ), which models the distance of the pages from each other.
Given the Web, , its undirected version, , and a query, , an answer to is a set of pages, , that covers all the keywords in the query; i.e.,
A minimal answer to , then, is a set of pages, such that no proper subset, , of is an answer to . The cost of a minimal answer, , to is then the sum of the edge costs of the tree with minimal cost in that contains all the vertices in .
The solid lines in this figure denote the links that are going to be used for computing the cost of the information units (these are the lines on the smallest tree) and the dashed lines denote the links that are being ignored since their cost is not the minimum. For instance, there are at least two ways to connect all three vertices in cluster A. One of these ways is shown with dark lines, and the sum of the corresponding edge weights is 12. Another possible way to connect all three vertices would be to use the dashed edges with weights 7 and 8. Note that if we were to use the second option, the total edge weights would be 15; i.e., larger than 12 that we can achieve using the first option. Consequently, the cost of the best minimal answer is 12, not 15. In this example, , , and .
The above query formulation, when instantiated with two keywords, can be easily solved by using the allpairs shortest path solution (however, this algorithm would require a complete knowledge of the graph). In the next section, we examine algorithms for query processing that find the results by discovering the graph (or the Web) incrementally.
Query Processing
Conventionally, answers to a Web query is an ordered list of pages, where
the order reflects the rank of a page with respect to the given query.
Consequently, we expect that the answers to an information unit query be
an ordered set of logical Web documents; i.e., an ordered list of
sets of Web pages. The ranks of the information units are computed by aggregating
the cost of the edges involved in the graph representing the query result
connected via linkin and linkout pages as described in the previous section.
There are two ways to generate such an ordered list:
 generate all possible results and, then, sort them based on their individual ranks, or
 generate the results in the decreasing order of their individual ranks.
In this section, we develop a progressive query processing algorithm for retrieving information units on the web. Since the search space is very large, the progressive algorithm relies on local information. The implication of relying on local information to produce answers is that the ranking of query results is approximate.
Abstract Formulation of the Problem
The problem of finding the minimum weighted connected subgraph, , of a given graph , such that includes all vertices in a given subset of is known as the Steiner tree problem [2]. If it exists, is guaranteed to be a tree. An extension of this problem, where we are given a set of sets of vertices such that the subgraph has to contain at least one vertex from each group is known as the group Steiner tree problem. The problem of finding the best answer with the minimal cost information unit to a query, , can be translated into the problem of finding minimumweighted group Steiner tree problem as follows: Let us be given an undirected Web,, and a query . Let also , , be the set of vertices, , such that . Let us assume that the corresponding minimum weight group Steiner tree consists of a set of vertices, and a set of edges . Then, the best answer with the minimal cost,, with respect to , is the maximal subset of such that, for all , . Both minimum weight Steiner tree [2] and minimum weight group Steiner tree problems [3] are known to be NPhard.As a result of this NPcompleteness result, the minimum weight group Steiner tree problem (and consequently, the minimum cost information unit problem) is not likely to have a polynomial time solution, except in certain special cases, such as when vertex degrees are bounded by [4] or the number of groups is less than or equal to, i.e, or . However, there are a multitude of polynomial time approximation algorithms that can produce solutions with bounded errors. The most recent of these solutions, to our knowledge, is presented by Garg et al. [5]. This particular algorithm provides a randomizedapproximation, where is the number of vertices in the graph and is the number of groups. An earlier result, by Bateman et al. [6] had a performance guarantee, which is independent of yet nonlogarithmic in . However, since in the domain of Web querying, is guaranteed to be much smaller than , this earlier result is more applicable.
Note, however, that none of the above algorithms satisfy our essential requirements:
 We are not only interested in the minimumweight group Steiner tree. Instead, what we need is minimumweight group Steiner tree, where is the rank of the corresponding answer in the result.
 We would prefer to generate minimumweight group Steiner tree, after we generate minimumweight group Steiner tree, reusing results of the some of the earlier computation.
 Since the Web is large, we can not perform any preprocessing or any computation which would require us to touch or enumerate all pages on the Web.
Algorithm for Information Unit Retrieval
In this section, we develop a heuristic query processing algorithm, to
retrieve information units, that adheres to the stated requirements. Unlike
the other minimum spanning tree based algorithms, we do not generate the
minimum spanning tree for the entire graph (or the entire Web). Furthermore,
unlike other shortestpath based algorithms, we refrain ourselves from
generating all possible shortest paths. Note that this does not mean that
we do not know of the structure of the Web. In fact, in our implementation,
we used the proposed algorithm in conjunction with a Web search index we
maintain at NEC CCRL. The search index contains information regarding the
Web pages, as well as the incoming and outgoing links to them.
The general idea of the algorithm is follows. Given a query , we identify the corresponding set of pages such that appears in each page in . The algorithm starts with a graph represented by the vertices in and then by exploring links (incoming or outgoing) with the minimum cost. During this exploration process, if we find a subgraph that satisfies the query, we output that as a query result. The rank of the query result is estimated by constructing a minimum cost spanning tree over the subgraph such that all the solution vertices are connected. Figure 4 depicts the essential parts of this algorithm.
Lines 57 of Figure 4 are the initialization steps for the control variables. The main loop is depicted in lines 835 of the algorithm. The algorithm effectively grows a forest of MSTs. During each iteration of the loop, we choose a function, (Figure 5), to choose among different ways to grow the forest. Note that depending on the goal, we can use different choice strategies. In this paper, we describe two strategies: minimum edgebased strategy and balanced MST strategy. These two strategies will result in different degree of error and complexities. In Sections 3.3.1 and 3.3.2, we evaluate the two strategies and discuss their quality and cost tradeoffs.
Note that, in this algorithm, we assume that the costs of all neighboring vertices to be given, but in the real implementation this cost can be computed onthefly. The cost may be based on a variety of factors. For example, the neighboring vertex is in the same domain versus outside the domain or relevancy of links based on anchor text and/or URL strings. Much research on this issue has been done in the scope of efficient crawling [7].
After choosing an MST and an edge for growth, the algorithm checks if (line 10) the inclusion of this edge causes two MSTs to merge. If this is indeed the case, the algorithm next checks if the resulting subgraph can satisfy the query (lines 1722). Essentially, this check determines if the new MST has a set of vertices (pages) such that the pages collectively include each keyword in the query . This step ensures that we only consider newly produced query results. The new query results are ranked by using the minimum spanning tree of the connected subgraph. Finally, the newly obtained (using subroutines EnumerateFirst and EnumerateNext) query results are output (lines 2531); the algorithm terminates if results are produced, otherwise the algorithm continues by adding the next minimum cost edge incident on the current set of vertices. In lines 33 and 34 the algorithm inserts the new MST into the growth candidates and prepares for a new iteration of the search.
Note that the minimum spanning tree computation (lines 11 and 14) of this algorithm is incremental. Since the algorithm visits edges in the increasing order of the costs, the tree is constructed incrementally by adding neighboring edges while growing a forest (multiple trees); consequently, it overcomes the weaknesses of both Kruskal's and Prim's algorithm when applied to the group Steiner tree generation on the Web. In particular, the algorithm proposed here does not require complete knowledge of the Web and it does not get stuck at one nonpromising seed by growing a single tree. Next, we use an example to further illustrate the details of the algorithm. In this example, we will assume that the chooseGrowthTarget subroutine simply chooses the least cost edge for growth.
 Figure 6(b) shows the first step in the algorithm. An edge with the smallest cost has been identified and the endpoints of the edge is inserted into . Note that, in the figure, all vertices in are surrounded by a square and all edges included are shown as darker lines.
 Figure 6(c) shows the step in which the algorithm identifies the first information unit. At this stage, the algorithm first identifies a new possible solution, and it verifies that the solution is within a single connected subgraph as a whole. Next, the algorithm identifies the corresponding Steiner tree through a sequence of minimum spanning tree computations and cleanups (these steps are not shown in the figure). The resulting Steiner tree is shown with thicker lines. Note that the dark nodes denote the physical pages which form the information unit. Hence, at the end of this step, the algorithm outputs the solution, which has the total cost , as the first result.
 Finally, Figure 6(d) shows the algorithm finding the second and last Steiner tree. In this step, the newly added edge reduces the number of individual connected components in the graph to one. Consequently, the algorithm identifies a new solution. The algorithm again identifies the corresponding Steiner tree through a sequence of minimum spanning tree computations and cleanups; these steps are shown in the Figure 7:
 Figure 7(a) shows the state of the graph at the beginning of this step. Note that the graph denoted with darker lines is connected and subsumes all vertices, denoted darker, in .
 Figure 7(b) shows the minimum spanning tree,.
 Figure 7(c) shows the Steiner tree, , obtained by removing the unnecessary vertices from.
Evaluation of the Algorithm
Since the proposed algorithm is based on local analysis and incomplete information, it is not surprising that it has limitations when compared to the optimal solution. Note that the algorithm is polynomial whereas, as we discussed earlier, finding the optimal solution is NPhard. We now point to the cases in which the solution of our algorithm differs from the optimal solution through a series of examples. In this section, in order to show the tradeoff between quality and cost of the heuristic, we will use two different functions which will result in different degrees of error and complexities. These two functions are shown in Figure 8. Intuitively, the first function chooses the smallest weighted edge at each iteration, whereas the second one aims at growing the MSTs at a balanced fashion. In Section 5, we will provide experimental evaluations of the algorithm and investigate the degree of divergence of both heuristics from the optimality.
Evaluation of the Minimum Edgebased Strategy
Let us consider the graph shown in Figure 9(a).
As per the chooseGrowthTarget
subroutine, the edges that will be chosen will be , ,
and .
Since the weight of all three edges is the same, the order in which the
edges are included is nondeterministic and does not impact the final outcome.
The state of the graph will be as shown in Figure 9(b).
At this state, the algorithm will generate the one and only one information
unit for the three keywords; the cost or rank associated with this information
unit is the sum of the weights of the three darkened edges which is 12.
However, there exists a Steiner tree of lower cost (actually there are
three possible Steiner trees: ,
and
each with cost 10), which would have sufficed to form an information unit.
This example illustrates that the proposed algorithm may not always generate
optimal solutions.
where is the number of vertices in the group Steiner tree and is the cost of the most expensive edge in the returned group Steiner tree. In general, when there are groups (note the similarity between Figure11(c) and 9(b)), the overestimation ratio becomes
where is the number of vertices in the group Steiner tree,s are the costs of the edges on an optimal group Steiner tree connecting vertices, is the smallest cost of an edge and is the largest cost of an edge in the returned group Steiner tree.
Consequently, every solution, , returned by the heuristic has a corresponding range, , where is the cost returned by the algorithm and . Note that, consequently, the algorithm does not guarantee a progressive order of the results as discussed in earlier examples. In Section 5, we provide experimental evaluation of the algorithm to see how much it diverges from the optimal results. The results show that the divergence in reality is much less compared to the worst case analysis provided above (i.e. it is almost a constant factor) for information unit retrieval purposes.
Complexity. Let us assume that the maximum number of edges incident on a vertex is bounded by a constant (this is a reasonable assumption on the Web). Let us also assume that each of the first results is embedded in a subgraph of diameter . In the worst case, the number of vertices that will be touched by the edgebased growth strategy ( es) is (on the Web, the size of a subgraph generally does not increase exponentially with its diameter; this is a worst case figure. Note also that for small diameters, , where is the set of all pages on the Web.)
The worst case execution time of the proposed algorithm, then, is :
 Using a heap, maintaining the list of incident edges (maximum number of edges is ) takes time.
 Since the spanning tree is incrementally constructed, it takes time to construct the trees.
 For solutions it takes time to delete leaves of the spanning trees to get the approximate Steiner trees.
 It takes time to sort and return new results identified by the introduction of a new edge.
 Splitting the Web into domains, and limiting the search into only intradomain Web documents. Since, the input graph is divided into multiple, relatively small, subgraphs, this significantly reduces the search space. As we pointed out in Section 1, the average size of a domain is around 100 documents.
 Assigning an upper bound on the total cost or on the number of vertices of an acceptable Steiner tree. For example, we may restrict the search of related pages to form an information unit within a radius of links.
 Limiting the fanout of every vertex in the graph to a predetermined small constant. This causes the graph to be sparse thereby reducing the search time. Of course, it is a major challenge to identify which of the outgoing or incoming edges should be kept.
Evaluation of the Balanced MST based Strategy
In the previous section, we have seen that the overestimation ratio is
proportional to the number of vertices in the MST. Consequently, in order
to minimize the overestimation ratio, it is important to prevent the formation
of long MSTs. Furthermore, in order to minimize the absolute value of the
overestimation, we need to prevent very large MSTs to form. The next strategy
that we describe improves on the previous one on these aspects. Later in
Section 4, we show that the balanced
MST based strategy is essential to the extension for dealing with fuzziness
in keywordbased retrieval.
The subroutine chooseGrowthTarget performs a look ahead before choosing the next MST. It identifies all possible ways to grow the forest and it chooses the growth option where the operation will result in the smallest growth. Consequently, the minimum spanning trees are created in the increasing order of their cost, however since MST to Steiner tree conversion is suboptimal, results may still be suboptimal.
Let us reconsider the graph shown in Figure 9(a). As per the chooseGrowthTarget subroutine, the edges that will be chosen will again be , , and (only these edges will result in an MST). However, we earlier have seen that this selection is suboptimal. Consequently, since it is based on MST, this second strategy does not necessarily result in an optimal solution and the overestimation ratio we found for the first strategy still holds.
This strategy prevents the formation of long chains and instead favors the creation of balanced size minimum spanning tree clusters. Note that since the overestimation ratio is proportional to the number of vertices in the MST, this results in a reduction in the amount of overestimation. This reduction is especially large when the edge weights in the graph are mostly the same, as when two edges have the same weight, the first strategy does not dictate a preference among them.
Complexity. The complexity of the second strategy is similar to the complexity of the first strategy: Let be equal to . As earlier, let us assume that the maximum number of edges incident on a vertex is bounded by a constant . Let us also assume that each of the first results is embedded in a subgraph of diameter . In the worst case, the number of vertices that will be touched by the balanced MST strategy ( bs) is . Then the complexity of the algorithm is :
 Using a heap structure, maintaining the list of the MSTs (maximum number of MSTs is ) takes time.
 Since the spanning tree is incrementally constructed, it takes time to construct the trees.
 For solutions it takes time to delete leaves of the spanning trees to get the approximate Steiner trees.
 It takes time to sort and return new results identified by the introduction of a new edge.
 is definitely smaller than . Consequently, it is much cheaper to maintain the order of MSTs than to maintain the order of incident edges.
 Sorting and returning newly found results takes the same amount of time in both strategies.
 The amount of time spent for creating MSTs are the same in both algorithms: .
 The time spent for trimming MSTs are the same in both algorithms: .
Note, however, among the four components that we identified above, the time spent for creating the MSTs is the most important one, as it describes the number of webpages that may need to be visited and processed (unless the information is readily available as an index). For this component, both strategies have the same worstcase complexity . On the other hand, when the edge weights are distinct edgebased strategy can cover larger distances () without visiting all the vertices in the same diameter, leading into a lower number of page visits. When the edge weights are mostly the same, on the contrary, the balanced MSTbased strategy is likely to eliminate the formation of unnecessarily long branches of the MSTs, leading into a lower number of page visits.
Dealing with Partial Matches and Fuzziness in Retrieval
In this section, we describe how to extend the algorithm to deal with fuzzy
and partial keywordbased retrieval problems. More specifically, we address
the following issues:
 Disjunctive queries: In this paper, we have formulated the Web query as a set of keywords, denoting a conjunctive query asking for all keywords. One extension to our system is to handle disjunctive queries. A disjunctive query is a query where any one of the keywords is enough to satisfy the query criterion. Combinations of conjunctions and disjunctions can also be handled through queries in conjunctive normal or disjunctive normal forms. The actual process for handling combinations of conjunctions and disjunctions is not discussed further in this paper.
 Partial matches (or missing keywords): In some cases, a user may issue a conjunctive query and may be willing to accept query results which do not have all the query terms. Clearly, for such a query, the user will prefer results which contains more keywords to the results which contain less keywords. One solution to such an extension is to translate a conjunction query to a disjunctive query . This formulation, however, would not penalize results with missing keywords. Hence, the query processing algorithm needs to be modified so that the results with partial matches are ranked lower than the results with all the query terms.
 Similaritybased keyword matching and keyword: Because of the mismatch between authors' vocabularies and users' query terms, in some cases, users may be interested in not only results which exactly match with the query terms, but also results which contains keywords related to the query terms. For example, a logical document which contains keywords, ``Web'' and ``symposium'' may be an acceptable result to a query by keyword semantical relaxation. In this case, we assume that there is a similarity function, , where is the dictionary, that describes the similarity of keywords. Such similarity functions have been studied elsewhere [8].
 Keyword importance: In the problem formulated in this paper, we assumed that each keyword in a given query, , is of the same importance. However, in some cases, we may want to give preference to some of the keywords. For example, for a query with the keywords , the keyword Y2K may be more important than the two other terms. We can also assume that there is an importance function, , where is the dictionary, that describes the importance of the keywords for a given query.
We now describe the proposed solutions.
Partial Matches (Missing Keywords)
In order to consider query results which contain only a partial set of query terms, we need to adapt our technique to include group Steiner trees that do not fully cover all keywords. Instead of modifying the algorithm, we apply a graph transformation, , to the original graph representation of the Web, , so that the algorithm in Section 3.2 can be used to handle partial matches. The transformation assumes that the second growth strategy is used by the algorithm.Description of the transformation: Let us assume that the set of nodes that contain keyword, , is denoted as . The transformation attaches to each node, , a set of pseudo nodes . It updates the keyword/page mappings, and , such that the only keyword that new pseudo node contains is . The transformation, then, updates the distance function, , such that the distance between and each new vertex, , is .
Fuzzy Keyword Matching
Both fuzzy keyword matching and keyword importance require preferencebased processing: In the case of fuzzy keyword matching, we are given a set of keywords and we want to accept the result even if some of the keywords are not exactly matching, but are similar. Of course, we are trying to maximize the overall similarity. In the case of importancebased retrieval, on the other hand, each keyword has a different importance and we are trying to maximize the overall importance of the keywords. To handle this, we again use a graph transformation, , to handle similarity and importancebased processing. The details of the transformation is beyond the scope of this paper.
Experimental Evaluation
As discussed in Section 3, the proposed
heuristic does not always generate information units with their optimal
cost and it can also provide outoforder solutions. However, if we consider
the fact that the underlying problem of finding an information unit in
the Web is NPhard and that the graph has to be constructed incrementally,
this is not unexpected. In this section, we use empirical analysis to evaluate
the overall quality of the proposed algorithm.
Evaluation Criterion
One of the first experiments we conduct is on a set of synthetic graphs that are generated randomly. The parameters of these synthetic graphs are listed in Table 1. In order to have a yardstick to compare our results, we first perform an exhaustive search to find all information units along with their optimal costs. Next, we run our algorithm incrementally. We visualize the results of this experiment in three ways.In the first measure, we compute the average cost of information units under both schemes (exhaustive and heuristic) as a function of top results where is varied from 10 to 100 in the increments of 10. This plot indicates the deviation or dispersion error of the heuristic. Note that since the heuristic generates suboptimal results, and once a solution is generated it is never revised, we are bound to have some error even after exploring the entire graph.
The second plot in this experiment shows the percentage of nodes and edges used in generating top results. This plot allows us to measure the efficiency of progressive query processing.
The third plot shows the recall ratio, which captures the performance of the top results produced by the proposed algorithm. The recall ratio is computed as a percentage of information units in the answer set generated from the heuristic when compared to that from the exhaustive search. For example, if the query top returns the , , , , and information units (as ranked by the exhaustive search), the recall ratio is 80% since the heuristic missed the information unit. This is similar to the recall measure used by information retrieval. Unfortunately, it penalizes the results by not giving any credits for the result, which can also be of some value.
Thus, we also visualize another parameter called adjusted recall ratio. The adjusted recall better reflects the utility of the results for information unit retrieval. The adjusted recall is calculated as follows: Obviously, since the query results is supposed to be a sorted list of information units, the information retrieval system should be penalized by providing the information unit instead of the information unit. Therefore, we give partial credit, , for such recall instead of giving full credit of . Second, we should give the information unit partial credit. In this example, we give a score of. Note that this formulation has a nice property: If the information unit was returned as the last result, the system would give a score of . This is appropriate since the information unit is of more value than the information unit. Similarly, if the information unit was returned, the system will give a score of . This is also appropriate since the information unit is of more value than the information unit. The adjusted recall ratio for the top results is calculated as follows:
where is the actual rank of the result.
Let the user ask for the top solutions and let the rank of the results returned by the algorithm be .
 Recall ratio: The algorithm returns out of the required solutions. Hence the recall ratio is .
 Adjusted recall ratio: The adjusted recall ratio is:

In the second experiment, instead of using synthetic data, we conducted experiments with real Web data. We downloaded the pages from wwwdb.stanford.edu/people/. The graph corresponding to the Web data consists of 236 nodes and 414 edges. Presentation slides and technical reports in postscript format were excluded. The weight associated with each edge was set to 1. On this data, we ran a threekeyword query involving keywords: . The reason for this choice was that the above three keywords had a reasonable number of occurrences in the graph: 14, 7, 7, respectively. In the next subsection, we summarize our findings with regard to these experiments.
Experimental Results on Synthetic Graphs
The experiment results show that, in a neighborhood with 100 nodes, it takes on the order of 300ms (only 1020ms of which is the system time) to generate the top 10 retrieval results. When the user request is increased to top25 retrieval, it takes on the order of 1700ms (only 2030ms of which is the system time) to generate the results. Note that, since the process is progressive, top25 generation time can be hidden while user is reviewing the top10 results list.Figures 14, 15, and 16, depict the result of performing three keyword queries over the 100 node, 500 node, and 1000 node synthetic graphs, respectively. Figures 14(a), 15(a), and 16(a) report the average cost of the information unit in the answer set of size from 10 to 100 in the increments of 10. As expected, due to the suboptimal nature of the proposed heuristic, the average cost of information units is inferior to the one produced by the exhaustive search. In the case of 100 node graph the cost inflation is within two times the optimal cost and in the case of 500 nodes it is approximately times the optimal cost. For 1000 nodes graph the cost inflation is also around three times the optimal cost. We expect if the variability of the weights of the edges is reduced, the inflation in the cost will become narrower. In fact, we observe this effect on the experiments that we run on real data (Section 5.3).
Finally, Figures 14(c), 15(c), and 16(c) report the recall ratio and the adjusted recall ratio of the proposed heuristic. As discussed in Section 3, the proposed heuristic generates result not necessarily in the ranked order. Furthermore, the ranking of results itself is not always as specified by the optimal order. Due to the combination of these two limitations of the proposed heuristic, we achieve less than perfect recall. As observed, the recall ratio for 100 nodes is in the range of 10% (when the size of the answer set is very small) to 50%. For 500 nodes the range is 20% to 55% and for 1000 nodes the range is 0% (for top results) to 35%. If we exclude the top datapoint, the lower end of the recall ratio for the 1000 node graph becomes about 15%. As we have argued, the traditional recall measure imposes severe penalties for the result misses. Under this measure, for example, if in the top answer set the element is replaced by the element, the loss of accuracy is 10%. Similarly, if the answer set includes to elements instead of the first 10, the loss of accuracy is 100%. The adjusted recall ratio defined above, on the other hand, is more realistic in the sense it penalizes the misses but does give partial credit for retrieving lower ranked results. In our experiments we found that the range for adjusted recall for 100 nodes is 20% to 60%; for 500 nodes it is 20% to 70%, and for 1000 nodes it is 0% (for the top datapoint) to 50%.
Figures 14(a), 15(a), and 16(a), report the average cost of the information unit in the answer set of size from 10 to 100 in the increments of 10. As we explained earlier, due to the suboptimal nature of the proposed heuristic, the average cost of information units is inferior to the one produced by the exhaustive search. Another reason is that our algorithm only explores on an average less than 10% of edges. To identify how well our solutions are compared with the optimal solution if they are derived based on the same graph, we conducted additional experiments to find the optimal solutions for the same subgraphs our algorithm has explored rather than the whole graphs. We then compare the average cost of our suboptimal solutions with the optimal solutions in Figure 17. The experimental results show that the average costs of our solutions are closer to the cost of the optimal solutions compared with the experimental results shown in Figures 14(a), 15(a), and 16(a), especially for larger graphs with 500 and 1000 nodes.
Evaluation on Real Web Data Set
Figure 18 reports the results of our
experiments with actual Web data. Figure 18(a)
illustrates the average costs under the two schemes. Here we see that the
average cost of the information units is within 30% of that computed by
the exhaustive algorithm. The reason for the small error in the cost inflation
is because the edges in the Stanford Web data have a unit cost. As shown
in Figure 18(b), a larger percentage
of the edges are visited because the low connectivity of the Web pages
in the chosen dataset. Finally, Figure 18(c)
reports the recall ratio which is in the range of 30% to 60%. The decline
in recall between 50 and 100 results (xaxis) can be explained as follows:
Note that from Figure 18(b) we can
observe that the visited portion of the graph does not change much indicating
that the graph is large enough to compute the answers in this range. Due
to the greedy approach of the heuristic, when a given connected component
has multiple answers, these answers are produced in a random order and
not necessarily in the order of their costs. This contributes to the drop
in recall. However, the adjusted recall ratio reaches almost 70% and the
curve remains flat, which validates the performance of our heuristics algorithm
since it provides the same level of recall utility to the users. More interestingly,
our algorithm, shown in Figure 18(c),
can provide 70% of adjusted recall ratio by exploring only about 30% of
nodes and 25% of edges.
In summary, our experiments on synthetic data are validated with actual data and are promising. In particular, the proposed heuristic generates information units of acceptable quality by exploring a very small part of the graph. By comparing the experimental results on the real Web data and on the synthetic graphs, we found that our heuristic algorithm performs much better on the real Web data in all categories. We examined the Web site connectivity of the Stanford site and found that the link fanout is around 4 on average. We exclude the search on presentation slides and technical reports in PDF or Postscript format. Another reason for such low fanout is that the personal home pages usually have low depth and the leaf nodes reduce the average fanout. We also found the link structures of real Web sites are more like "trees", rather than highly connected "graphs" used in experiments on synthetic data. We observe that the algorithm performs better in searching a treelike structure with lower connectivity. We are pleased to see the heuristic algorithm performs better on the real Web data.
Related Work
We have discussed existing work on group Steiner trees in Section
3.
In this section, we give an overview of work in the area of integrating
content search on the Web and Web structure analysis.
Although search engines are one of the most popular methods to retrieve information of interest from the Web, they usually return thousands of URLs that match the user specified query terms. Many prototype systems are built to perform clustering or ranking based on link structures [9,10] or links and context [11,12]. Tajima et al. [13] presented a technique which uses cuts (results of Wen structure analysis) as querying units for WWW. [14] first present the concept of information unit and [15] extends to rank query results involved multiple keywords by (1) finding minimal subgraphs of links and pages including all keywords; and (2) computing the score of each subgraph based on locality of the keywords within it.
Another solution to the above problem is the topic distillation [16,17,11,9] approach. This approach aims at selecting small subsets of the authoritative and hub pages from the much larger set of domain pages. An authoritative page is a page with many inward links and a hub page is a page with many outward links. Authoritative pages and hub pages are mutually reinforcing: a good authoritative page is linked by good hub pages and vice versa. In [18], Bharat et al. present improvements on the basic topic distillation algorithm [16]. They introduce additional heuristics, such as considering only those pages which are in different domains and using page similarity for mutual authority/hub reinforcement.
Similar techniques to improve the effectiveness of search results are also investigated for database systems. In [19], Goldman et al. propose techniques to perform proximity searches over databases. In this work, proximity is defined as the shortest path between vertices (objects) in a given graph (database). In order to increase the efficiency of the algorithm, the authors also propose techniques to construct indexes that help in finding shortest distances between vertices. In our work, in the special case of two keywords we also use shortest distances. In the more general case, however, we use minimal group Steiner trees to gather results. Note that minimal group Steiner trees reduce to the shortest paths when the number of groups, that is, keywords, is limited to two. DataSpot, described in [20], aims at providing ranked results in a database which uses a schemaless semistructured graph called a Web View for data representation.
Compared with existing work, our work aims at providing more efficient graph search capability. Our work focuses on progressive query processing without the assumption of that the complete graph is known. Our framework considers queries with more than 2 keywords, which is significantly more complex. In addition, we also present how to deal with partial matches and fuzziness in retrieval.
Concluding Remarks
In this paper, we introduced the concept of information unit, which
is as a logical document consisting of multiple physical pages. We proposed
a novel framework for document retrieval by information units. In addition
to the results generated by existing search engines, our approach further
benefits from the link structure to retrieve results consisting of multiple
relevant pages associated by linkage and keyword semantics. We proposed
appropriate data and query models and algorithms that efficiently solve
the retrieval by information unit problem. The proposed algorithms satisfy
the essential requirement of progressive query processing, which ensures
that the system does not enumerate an unnecessarily large set of results
when users are interested only in top matches. We presented a set of experiment
results conducted on synthetic as well as real data. These experiments
show that although the algorithm we propose is suboptimal (the optimal
version of the problem is NPhard), it is efficient and provides adequate
accuracy.
Acknowledgement
This work by Quoc Wu was performed when the author was with NEC USA Inc.Bibliography
 1
 Bruce Croft, R. Cook and D. Wilder. Providing Government Information on the Internet: Experiences with THOMAS.
 2
 S.L. Hakimi.
 3
 G. Reich and P. Widmayer.
 4
 E. Ihler.
 5
 N. Garg, G. Konjevod, and R. Ravi.
 6
 C.D. Bateman, C.S. Helvig, G. Robins, and A. Zelikovsky.
 7
 J. Cho, H. GarciaMolina, and L. Page.
 8
 R. Richardson, Alan Smeaton, and John Murphy.
 9
 David Gibson, Jon M. Kleinberg, and Prabhakar Raghavan.
 10
 Lawrence Page and Sergey Brin.
 11
 David Gibson, Jon Kleinberg, and Prabhakar Raghavan.
 12
 Sougata Mukherjea and Yoshinori Hara.
 13
 Keishi Tajima, Yoshiaki Mizuuchi, Masatsugu Kitagawa, and Katsumi Tanaka.
 14
 WenSyan Li and YiLeh Wu.
 15
 Keishi Tajima and Kenji Hatano and Takeshi Matsukura and Ryoichi Sano and Katsumi Tanaka.
 16
 Jon Kleinberg.
 17
 Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg.
 18
 Krishna Bharat and Monika Henzinger.
 19
 Roy Goldman, Narayanan Shivakumar, Suresh Venkatasubramanian, and Hector GarciaMolina.
 20
 Shaul Dar, Gadi Entin, Shai Geva, and Eran Palmon.
In Proceedings of Digital Libraries (DL'95), 1995.
Steiner's Problem in Graphs and its Implications.
Networks, 1:113131, 1971.
Approximate Minimum Spanning Trees for Vertex Classes.
In Technical Report, Inst. fur Informatik, Freiburg Univ., 1991.
Bounds on the Quality of Approximate Solutions to the Group Steiner Tree Problem.
In Proceedings of the 16 Int. Workshop on Graph Theoretic Concepts in Computer Science. Lecture Notes in Computer Science, pages 109118, 1991.
A Polylogarithmic Approximation Algorithm for the Group Steiner Tree Problem.
In Proceedings of the 9 Annual ACMSIAM Symposium on Discrete Algorithms, pages 253259, 1998.
Provably Good Routing Tree Construction with Multiport Terminals.
In Proceedings of the ACM/SIGDA International Symposium on Physical Design, pages 96102, Napa Valley, CA, April 1997.
Efficient Crawling through URL ordering.
Computer Networks and ISDN Systems. Special Issue on the Seventh International WorldWide Web Conference, Brisbane, Australia, 30(17):161172, April 1998.
Using Wordnet as a Knowledge base for Measuring Conceptual Similarity between Words.
In Proceedings of Artificial Intelligence and Cognitive Science Conference, Trinity College, Dublin, 1994.
Inferring Web Communities from Link Topology.
In Proceedings of the 1998 ACM Hypertext Conference, pages 225234, Pittsburgh, PA, USA, June 1998.
The Anatomy of a LargeScale Hypertextual Web Search Engine.
In Proceedings of the 7th WorldWide Web Conference, pages 107117, Brisbane, Queensland, Australia, April 1998.
Clustering Categorical Data: An Approach Based on Dynamic Systems.
In Proceedings of the 24th International Conference on Very Large Databases, pages 311322, September 1998.
Focus+Context Views of WorldWide Web Nodes.
In Proceedings of the 1997 ACM Hypertext'97 Conference, pages 187196, Southampton, UK, March 1997.
Cut as a Querying Unit for WWW, Netnews, email.
In Proceedings of the 1998 ACM Hypertext Conference, pages 235244, Pittsburgh, PA, USA, June 1998.
Query Relaxation By Structure for Document Retrieval on the Web.
In Proceedings of 1998 Advanced Database Symposium, Shinjuku, Japan, December 1999.
Discovery and Retrieval of Logical Information Units in Web.
In Proceedings of the 1999 ACM Digital Libraries Workshop on Organizing Web Space, Berkeley, CA, USA, August 1999.
Authoritative Sources in a Hyperlinked Environment.
In Proceedings of the 9th ACMSIAM Symposium on Discrete Algorithms, 1998.
Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.
In Proceedings of the 7th WorldWide Web Conference, pages 6574, Brisbane, Queensland, Australia, April 1998.
Improved Algorithms for Topic Distillation in a Hyperlinked Environment.
In Proceedings of the 21th Annual International ACM SIGIR Conference, pages 104111, Melbourne, Australia, August 1998.
Proximity Search in Databases.
In Proceedings of the 24th International Conference on Very Large Data Bases, pages 2637, New York City, New York, August 1998. VLDB.
DTL's DataSpot: Database Exploration Using Plain Language.
In Proceedings of the 24th International Conference on Very Large Data Bases, pages 645649, New York City, New York, August 1998. VLDB.