Exploring Link Topology for Associating Web Pages
Quoc Vu and Wen-Syan Li
C&C Research Laboratories,
NEC USA, Inc.
110 Rio Robles M/S SJ100, San Jose, CA 95134, USA
Many systems have been proposed to find the Web document
association, such as the companion and
co-citation algorithms proposed by Dean and
Henzinger and the
used to implement the What's Related? functionalities.
In contrast, we are interested in inducing
the reasons why they are associated.
We focus on the problem "for a given set of pages, find
a list of Web pages which reflect the association of such
two pages." We now use an example to illustrate the problem
statement and overview of our approach.
Link structure connecting and associating the Web pages of
W.Li and D.Agrawal
In Figure 1, we show a subset of links between two personal Web pages W.Li and D.Agrawal (both are highlighted in the figure). The purpose of each link is indicated as its label. For example, W.Li graduated from Northwestern University; whereas D.Agrawal and P.Scheuermann both graduated from SUNY. The associations between W.Li and D.Agrawal are implicitly expressed in the link structure connecting W.Li and D.Agrawal by many Web authors associated with W.Li and D.Agrawal. Below, we enumerate some associations that can be derived from this graph, and some possible interpretations for such associations:
- Association 1: Web8 paper page appears in a path of distance of 2 connecting the pages W.Li and D.Agrawal. Therefore, W.Li and D.Agrawal may be associated due to a co-authored paper.
- Association 2: Y.Wu page is on two paths related to NEC Research Laboratories, each of distance 4. W.Li and D.Agrawal may be associated due to the fact they both supervised Y.Wu at different occasions or they participate in the same project at NEC.
Obviously, the links connecting these pages are not intended to express such associations and the authors were not coordinated to make the links and anchors semantics consistent across the Web. However, we can assume and use the following two intuitions:
- Path length factor: Pages on a shorter path between W.Li and D.Agrawal are stronger indicators to reflect why the W.Li and D.Agrawal pages are associated than those on a longer path.
- Connectivity factor: Pages which appear on more paths should be stronger indicators than others to reflect why the W.Li and D.Agrawal pages are associated.
Note that a page with a higher connectivity (i.e. more incoming links and outgoing links) is more likely to be included in more paths; consequently, such a page is more likely to be ranked higher according to the above criteria. This is consistent with the principle of topic distillation[3, 4].
Links have been used in many fields to associate documents. In this paper, we consider three type of relationships:
- co-citation: two documents are relevant to each other if they are linked by a common document.
- social filtering: two documents are relevant to each other if they link to a common document.
- transitivity: two documents are relevant to each other if there is a path from one to another.
The scoring functions are used to measure how well a page in inducing the association of the given pages. The first scoring function is used to measure the significance of a node in both the connectivity factor and the distance factor.
To derive metrics for nodes selection, one straight forward
candidate metric, that adjusts connectivity scores by distance
where paths(A,B,v) is the set of paths between the given pages, A and B, that pass through a candidate v, and length(p) is the length of the path p.
We are also interested in nodes appearing in many short
paths but not many long paths.
Based on the previous scoring function, this type of nodes
may not receive high scores. However, such exceptions
are also interesting since such nodes are located in a more
isolated sub-graph. Thus we design another scoring function
for exceptions as follows:
In contrast to other work focusing on finding related pages for a given page, we deal with the problem "for a given pair of pages, find a list of Web pages which reflect the association of such two pages." We design two scoring functions for measuring different aspects of page associations. We have developed and implemented the exploration algorithms and experimentally evaluate the algorithms on real Web data. Future work includes conducting experiments on real Web data to evaluate the effectiveness of content focused algorithm. We are also interested in applying this work to workflow management to identify the bottle neck and critical points in the workflow.
- Jeffrey Dean and Monika Henzinger, Finding Related Pages in the World Wide Web, In Proceedings of the 8th World-Wide Web Conference, Toronto, Canada, May 1999.
- Netscape Communications Corporation, What's Related web page, Information available at http://home.netscape.com/netscapes/related/faq.html.
- Jon M. Kleinberg, Authoritative sources in a hyperlinked environment, In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pages 668-677, January 1998.
- Wen-Syan Li and Selcuk Candan, Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management, To appear in ACM Computing Survey, 2000.