www8.dvi
First | Next ->

Finding Related Pages in the World Wide Web

Je rey Dean

Monika R. Henzinger

mySimon, Inc.

Compaq Systems Research Center

130 Lytton Ave.

Santa Clara, CA

Palo Alto, CA 94301

jdean@mysimon.com monika@pa.dec.com

Abstract

When using traditional search engines, users have to formulate queries to describe their

information need. This paper discusses a di erent approach to web searching where the input

to the search process is not a set of query terms, but instead is the URL of a page, and the output

is a set of related web pages. A related web page is one that addresses the same topic as the

original page. For example, www.washingtonpost.com is a page related to www.nytimes.com,

since both are online newspapers.

We describe two algorithms to identify related web pages. These algorithms use only the

connectivity information in the web (i.e., the links between pages) and not the content of pages

or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the e ectiveness of our algorithms, we performed a user study comparing

our algorithms with Netscape's \What's Related" service [12]. Our study showed that the

precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape,

despite the fact that Netscape uses both content and usage pattern information in addition to

connectivity information.

Keywords: search engines, related pages, searching paradigms.

1 Introduction

Traditional web search engines take a query as input and produce a set of (hopefully) relevant

pages that match the query terms. While useful in many circumstances, search engines have the

disadvantage that users have to formulate queries that specify their information need, which is

prone to errors. This paper discusses how to nd related web pages, a di erent approach to web

searching. In our approach the input to the search process is not a set of query terms, but the URL

of a page, and the output is a set of related web pages. A related web page is one that addresses the

same topic as the original page, but is not necessarily semantically identical. For example, given

www.nytimes.com, the tool should nd other newspapers and news organizations on the web. Of

course, in contrast to search engines, our approach requires that the user has already found a page

of interest.

Recent work in information retrieval on the web has recognized that the hyperlink structure

can be very valuable for locating information [18, 3, 7, 23, 19, 25, 24, 6, 17, 5]. This assumes that

if there is a link from page v and w, then the author of v recommends page w, and links often

connect related pages. In this paper, we describe the Companion and Cocitation algorithms, two

algorithms which use only the hyperlink structure of the web to identify related web pages. For

example, Table 1 shows the output of the Companion algorithm when given www.nytimes.com as

*This work was done while the author was at the Compaq Western Research Laboratory.

1


First | Next ->