www8.dvi
<- Previous | First | Next ->

for Site A. Generally, this happens when the number of sites with strong relationships is greater

than ten, or when sites do not have similar enough content.

3 Implementation

In experimenting with these algorithms, we were fortunate to have access to Compaq's Connectivity

Server [4]. The Connectivity Server provides high speed access to the graph structure of 180 million

URLs (nodes) and the links (edges) that connect them. The entire graph structure is kept in memory

on a Compaq AlphaServer 4100 with 8 GB of main memory and dual 466 MHz Alpha processors.

The random access patterns engendered by the connectivity-based algorithms described in this

paper mean that it is important for most or all of the graph to t in main memory to prevent high

levels of paging activity.

We implemented a multi-threaded server that accepts a URL uses either the Cocitation algorithm or the Companion algorithm to nd pages related to the given URL. Our server implementation consists of approximately 5500 lines of commented C code, of which approximately 1000

lines implement the Companion algorithm, 400 lines implement the Cocitation algorithm, and the

remainder are shared code to perform tasks such as parsing HTTP query requests, printing results,

and logging status messages. We link our server code directly with the Connectivity Server library,

and access the connectivity information by mmapping the graph information into the address space

of our server.

Our implementation of the Companion algorithm has been subjected to a moderate amount

of performance tuning, mostly in designing the neighborhood graph data structures to improve

data-cache performance. The implementation of the Cocitation algorithm has not been tuned

extensively, although it does share a fair amount of code with the Companion algorithm, and this

shared code has been tuned somewhat.

4 Evaluation

In this section we describe the evaluation we performed for the algorithms. Section 4.1 describes

our user study, while Section 4.2 discusses the results of the study. Section 4.3 evaluates the run

time performance of our algorithms.
4.1 Experimental Setup

To compare the di erent approaches, we performed a user study. We asked 18 volunteers to supply

us with at least two URLs for which they wanted to nd related pages. Our volunteers included 14

computer science professionals (mostly our colleagues at Compaq's research laboratories), as well

as 4 people with other professional careers. We received 69 URLs and used each of the algorithms

to determine the top 10 answers for each URL. We put the answers in random order and returned

them to the volunteers for rating. The volunteers were instructed as follows:

We want to measure how well each algorithm performs. To measure performance we want to

know the percentage of valuable URLs returned by each algorithm. To be valuable the URL must

be both relevant to the topic you are interested in and a high quality page. For example, if your

URL was www.audi.com and you get back a newsgroup posting where somebody talks about his

new Audi car, then the page was on topic, but probably not high quality. On the other hand, if

you get www.jaguar.com as an answer, then it is up to you to decide whether this answer is on

topic or not.

Scoring:

7


<- Previous | First | Next ->