WWW10, May 15, 2001, Hong Kong. 2001 1581133480/01/0005
Finding Authorities and Hubs From Link Structures on the World Wide Web
Allan Borodin, Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4, and Gammasite, Hertzilya, Israel. Email: bor@cs.toronto.edu Web: http://www.cs.utoronto.ca/DCS/People/Faculty/bor.html
Gareth O. Roberts, Department of Mathematics and Statistics, Lancaster University, Lancaster, U.K. LA1 4YF. Email: g.o.roberts@lancaster.ac.uk Web: http://www.maths.lancs.ac.uk/dept/people/robertsg.html
Jeffrey S. Rosenthal, Department of Statistics, University of Toronto, Toronto, Ontario, Canada M5S 3G3. Supported in part by NSERC of Canada. Email: jeff@math.toronto.edu Web: http://markov.utstat.toronto.edu/jeff
Panayiotis Tsaparas, Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4. Email: tsap@cs.toronto.edu Web: http://www.cs.toronto.edu/~tsap
Abstract
Recently, there have been a number of algorithms proposed for analyzing hypertext link structure so as to determine the best "authorities" for a given topic or query. While such analysis is usually combined with content analysis, there is a sense in which some algorithms are deemed to be "more balanced" and others "more focused". We undertake a comparative study of hypertext link analysis algorithms. Guided by some experimental queries, we propose some formal criteria for evaluating and comparing link analysis algorithms.
Keywords: link analysis, web searching, hubs, authorities, SALSA, Kleinberg's algorithm, threshold, Bayesian
Introduction
In recent years, a number of papers [3,8,11,9,4] have considered the use of hypertext links to determine the value of different web pages. In particular, these papers consider the extent to which hypertext links between World Wide Web documents can be used to determine the relative authority values of these documents for various search queries.
We consider some of the previously published algorithms as well as introducing some new alternatives. One of our new algorithms is based on a Bayesian statistical approach as opposed to the more common algebraic/graph theoretic approach. While link analysis by itself cannot be expected to always provide reliable rankings, it is interesting to study various link analysis strategies in an attempt to understand inherent limitations, basic properties and "similarities" between the various methods. To this end, we offer definitions for several intuitive concepts relating to (link analysis) ranking algorithms and begin a study of these concepts.
We also provide some new (comparative) experimental studies of the performance of the various ranking algorithms. It can be seen that no method is completely safe from "topic drift", but some methods do seem to be more resistant than others. We shall see that certain methods have surprisingly similar rankings as observed in our experimental studies, however they cannot be said to be similar with regard to our formalization.
Previous Algorithms
The PageRank Algorithm
One of the earliest and most commercially successful of the efforts to use hypertext link structures in web searching is the PageRank algorithm used by Brin and Page [3] in the Google search engine [7].
The page rank of a given web page , denoted , is defined recursively according to the equation
where the sum is taken over all pages which have a link to page , is the total number of links originating from page , is a number between 0 and 1, and is a probability distribution (e.g. uniform) over all web pages.
Brin and Page [3] note that the value of is equivalent to the limiting fraction of time spent on page by a random walk which proceeds at each step as follows: With probability it jumps to a sample from the distribution , and with probability it jumps uniformly at random to one of the pages linked from the current page. This idea is also used by Rafiei and Mendelzon [11] for computing the "reputation" of a page. Intuitively, the value of is a measure of the importance or authority of the web page . This ranking is used as one component of the Google search engine, to help determine how to order the pages returned by a web search query.
Kleinberg's Algorithm
Independent of Brin and Page, Kleinberg [8] proposed a more refined notion for the importance of web pages. He suggested that web page importance should depend on the search query being performed. Furthermore, each page should have a separate "authority" rating (based on the links going to the page) and "hub" rating (based on the links going from the page). Kleinberg proposed first using a textbased web search engine (such as AltaVista [1]) to get a "Root Set" consisting of a short list of web pages relevant to a given query. Second, the Root Set is augmented by pages which link to pages in the Root Set, and also pages which are linked to pages in the Root Set, to obtain a larger `Base Set" of web pages. If is the number of pages in the final Base Set, then the data for Kleinberg's algorithm consists of an adjacency matrix , where if there are one or more hypertext links from page to page , otherwise .
Kleinberg's algorithm assigns to each page an authority weight and a hub weight . Let denote the vector of all authority weights, and the vector of all hub weights. Initially both authority and hub vectors are set to . At each iteration the operations ("in") and ("out") are performed. The operation sets the authority vector to . The operation sets the hub vector to . A normalization step is then applied, so that the vectors and become unit vectors in some norm. Kleinberg proves that after a sufficient number of iterations the vectors and converge to the principal eigenvectors of the matrices and , respectively. The above normalization step may be performed in various ways. Indeed, ratios such as will converge to the same value no matter how (or if) normalization is performed.
Kleinberg's Algorithm (and some of the other algorithms we are considering) converge naturally to their principal eigenvector, i.e. to the eigenvector of their transition matrix which corresponds to the largest eigenvalue. Kleinberg [8] makes an interesting (though nonprecise) claim that the subsequent nonprincipal eigenvectors (or their positive and negative components) are sometimes representative of "subcommunities" of web pages. It is easy to construct simple examples which show that subsequent eigenvectors sometimes are, but sometimes are not, indicative of subcommunities; we present a few indicative such examples in the full version of this paper. The significance of nonprincipal eigenvectors is an important topic that we intend to pursue further.
The SALSA Algorithm
An alternative algorithm, SALSA, was proposed by Lempel and Moran [9]. Like Kleinberg's algorithm, SALSA starts with a similarly constructed Base Set. It then performs a random walk by alternately (a) going uniformly to one of the pages which links to the current page, and (b) going uniformly to one of the pages linked to by the current page. The authority weights are defined to be the stationary distribution of the twostep chain doing first step (a) and then (b), while the hub weights are defined to be the stationary distribution of the twostep chain doing first step (b) and then (a).
Formally, let denote the set of all nodes that point to , that is, the nodes we can reach from by following a link backwards, and let denote the set of all nodes that we can reach from by following a forward link. The Markov Chain for the authorities has transition probabilities
Assume for a moment that the Markov Chain is irreducible, that is, the underlying graph consists of a single connected component. The authors prove that the stationary distribution of the Markov Chain satisfies , where is the set of all (backward) links. Similarly, the Markov Chain for the hubs has transition probabilities
and the stationary distribution satisfies , where is the set of all (forward) links.
SALSA does not really have the same "mutually reinforcing structure" that Kleinberg's algorithm does. Indeed, since , the relative authority of site within a connected component is determined from local links, not from the structure of the component. (See also the discussion of locality in Section 7.) We also note that in the special case of a single component, SALSA can be viewed as a onestep truncated version of Kleinberg's algorithm. That is, in the first iteration of Kleinberg's algorithm, if we perform the operation first, the authority weights are set to , where is the vector of all ones. If we normalize in the norm, then , which is the stationary distribution of the SALSA algorithm. A similar observation can be made for the hub weights.
If the underlying graph of the Base Set consists of more than one component, then the SALSA algorithm selects a starting point uniformly at random, and performs a random walk within the connected component that contains that node. Formally, let be a component that contains node , let denote the number of nodes in the component, and the set of (backward) links in component . Then, the authority weight of node is
Led astray by the simplifying assumption of a single component, we considered a simplified version of the SALSA algorithm where the authority weight of a node is the ratio . This corresponds to the case that the starting point for the random walk is chosen with probability proportional to the "popularity" of the node, that is, the number of links that point to this node. We will refer to this variation of the SALSA algorithm as pSALSA (popularity SALSA). We will consider the original SALSA algorithm as defined in [9] in the full version of our paper.
An interesting generalization of the SALSA algorithm is considered by Rafiei and Mendelzon [11]. They propose an algorithm for computing reputations that is a hybrid of the SALSA algorithm, and the PageRank algorithm. At each step, with probability , the Rafiei and Mendelzon algorithm jumps to a page of the collection chosen uniformly at random, and with probability it performs a SALSA step.
The PHITS Algorithm
Cohn and Chang [4] propose a statistical hubs and authorities algorithm, which they call the PHITS Algorithm. They propose a probabilistic model in which a citation of a document is caused by a latent "factor" or "topic", . It is postulated that there are conditional distributions of a citation given a factor , and also conditional distributions of a factor given a document . In terms of these conditional distributions, they produce a likelihood function.
Cohn and Chang then propose using the EM Algorithm of Dempster et al. [5] to assign the unknown conditional probabilities so as to maximize this likelihood function , and thus best "explain" the proposed data. Their algorithm requires specifying in advance the number of factors to be considered. Furthermore, it is possible that their EM Algorithm could get "stuck" in a local maximum, without converging to the true global maximum.
Random Walks and the Kleinberg Algorithm
The fact that the output of the first (half) step of the Kleinberg algorithm can be seen as the stationary distribution of a certain random walk on the underlying graph, poses the natural question of whether other intermediary results of Kleinberg's algorithm (and ultimately the output of the algorithm itself) can also be seen as the stationary distribution of a random walk. We show that this is indeed the case.
Some modifications to the Kleinberg and SALSA Algorithms
While Kleinberg's algorithm has some very desirable properties, it also has its limitations. One potential problem is the possibility of severe "topic drift". Roughly, Kleinberg's algorithm converges to the most "tightlyknit" community within the Base Set. It is possible that this tightlyknit community will have little or nothing to do with the proposed query topic.
A striking example of this phenomenon is provided by Cohn and Chang ([4], p. 6). They use Kleinberg's Algorithm with the search term "jaguar" (an example query suggested by Kleinberg [8]), and converge to a collection of sites about the city of Cincinnati! They determine that the cause of this is a large number of online newspaper articles in the Cincinnati Enquirer which discuss the Jacksonville Jaguars football team, and all link to the same standard Cincinnati Enquirer service pages. Interestingly, in a preliminary experiment with the query term "abortion" (another example query suggested by Kleinberg [8]), we also found the Kleinberg Algorithm converging to a collection of web pages about the city of Cincinnati!
Now, in both these cases, we believe it is possible to eliminate such errant behavior through more careful selection of the Base Set, and more careful elimination of intradomain hypertext links. Nevertheless, we do feel that these examples point to a certain "instability" of Kleinberg's Algorithm.
The HubAveragingKleinberg Algorithm
We propose here a small modification of Kleinberg's algorithm to help remedy the abovementioned instability. For motivation, consider the following. Suppose there are authority pages, and hub pages, with large. The first hubs link only to the first authority, while the final hub links to all authorities. In such a setup, we would expect the first authority to be considered much more authoritative than all the others, and Kleinberg's algorithm does indeed do this. On the other hand, it seems that the final hub should be worse than the others, since in addition to linking to a good authority (the first authority), it also links to many bad authorities. However, according to Kleinberg's algorithm, it is the best hub, 5pt because linking to more things can only improve your hub rating.
Inspired by such considerations, we propose an algorithm which is a "hybrid" of the Kleinberg and SALSA algorithms. Namely, it does the authority rating updates just like Kleinberg (i.e., giving each authority a rating equal to the sum of the hub ratings of all the pages that link to it), but does the hub rating updates by instead giving each hub a rating equal to the average of the authority ratings of all the pages that it links to. With this modified "HubAveraging" algorithm, a hub is better if it links to only good authorities, rather than linking to both good and bad authorities.
The ThresholdKleinberg Algorithms
We propose two different "threshold" modifications to Kleinberg's Algorithm. The first modification, HubThreshold, is applied to the instep . When computing the authority weight of the page, this algorithm does not take into account all hubs that point to page . It only counts those hubs whose hub weight is at least the average hub weight over all the hubs that point to page , computed using the current hub weights for the nodes. This corresponds to saying that a site should not be considered a good authority simply because a lot of very poor hubs point to it.
The second modification, AuthorityThreshold, is applied to the outstep . When computing the hub weight of the page, this algorithm does not take into account all authorities pointed to by page . It only counts those authorities which are among the top authorities, judging by current authority values. The value of is passed as a parameter to the algorithm. This corresponds to saying that a site should not be considered a good hub simply because it points to a number of "acceptable" authorities; rather, to be considered a good hub the site must point to some of the best authorities. This is inspired partially by the fact that, in most web searches, a user only visits the top few authorities.
We also consider a FullThreshold algorithm, which makes both the HubThreshold and AuthorityThreshold modifications to Kleinberg's Algorithm.
The BreadthFirstSearch Algorithm: A Normalized step Variant
When the pSALSA algorithm computes the authority weight of a page, it takes into account only the popularity of this page within its immediate neighborhood, disregarding the rest of the graph. On the other hand, the Kleinberg algorithm considers the whole graph, taking into account more the structure of the graph around the node, than the popularity of that node in the graph. Specifically, after steps, the authority weight of the authority is , where is the number of paths that leave node . Another way to think of this is that the contribution of a node to the weight of is equal to the number of paths that go from to . Therefore, if a small bipartite component intercepts the path between node and , the contribution of node will increase exponentially fast. This may not always be desirable, especially if the bipartite component is not representative of the query.
We propose the BreadthFirstSearch (BFS) algorithm, as a generalization of the pSALSA algorithm, and a restriction of the Kleinberg algorithm. The BFS algorithm extends the idea of popularity that appears in pSALSA from a one link neighborhood to an link neighborhood. However, instead of considering the number of paths that leave , it considers the number of neighbors of node . We let denote the set of nodes that can be reached from by following a path. The contribution of node to the weight of node depends on the distance of the node from . We adopt an exponentially decreasing weighting scheme. Therefore, the weight of node is determined as follows:
The algorithm starts from node , and visits its neighbors in BFS order. At each iteration it takes a Backward or a Forward step (depending on whether it is an odd, or an even iteration), and it includes the new nodes it encounters. The weight factors are updated accordingly. Note that each node is considered only once, when it is first encountered by the algorithm.
A Bayesian Algorithm
A different type of algorithm is given by a fully Bayesian statistical approach to authorities and hubs. Suppose there are hubs and authorities (which could be the same set). We suppose that each hub has an (unknown) real parameter , corresponding to its "general tendency to have hypertext links", and also an (unknown) nonnegative parameter , corresponding to its "tendency to have intelligent hypertext links to authoritative sites". We further suppose that each authority has an (unknown) nonnegative parameter , corresponding to its level of authority.
Our statistical model is as follows. The a priori probability of a link from hub to authority is given by
with the probability of no link from to given by
This reflects the idea that a link is more likely if is large (in which case hub has large tendency to link to any site), or if both and are large (in which case is an intelligent hub, and is a highquality authority).
To complete the specification of the statistical model from a Bayesian point of view (see, e.g., Bernardo and Smith [2]), we must assign prior distributions to the unknown parameters , , and . (These priors should be general and uninformative, and should not depend on the observed data. For large graphs, the choice of priors should have only a small impact on the results.) To do this, we let and be fixed parameters, and let each have prior distribution , a normal distribution with mean and variance . We further let each and have prior distribution (since they have to be nonnegative), meaning that for , .
The (standard) Bayesian inference method then proceeds from this fullyspecified statistical model, by conditioning on the observed data, which in this case is the matrix of actual observed hypertext links in the Base Set. Specifically, when we condition on the data we obtain a posterior density for the parameters . This density is defined so that
for any (measurable) subset , and also
for any (measurable) function . An easy computation gives the following.
Our Bayesian algorithm then reports the conditional means of the parameters, according to the posterior density . That is, it reports final values , , and , where, for example
To actually compute these conditional means is nontrivial. To accomplish this, we used a Metropolis Algorithm. (The Metropolis algorithm is an example of a Markov chain Monte Carlo Algorithm; for background see, e.g., Smith and Roberts [13]; Tierney [14]; Gilks et al. [6]; Roberts and Rosenthal [12]).
There is, of course, some arbitrariness in the specification of this Bayesian algorithm, e.g., in the form of the prior distributions and in the precise formula for the probability of a link from to . However, the model appears to work well in practice, as our experiments show. We note that it is possible that the priors for a new search query could instead depend on the performance of hub on different previous searches, though we do not pursue that here.
This Bayesian algorithm is similar in spirit to the PHITS algorithm of Cohn and Chang [4] described earlier, in that both use statistical modeling, and both use an iterative algorithm to converge to an answer. However, the algorithms differ substantially in their details. Firstly, they use substantially different statistical models. Secondly, the PHITS algorithm uses a nonBayesian (i.e. "classical" or "frequentist") statistical framework, as opposed to the Bayesian framework adopted here.
A Simplified Bayesian Algorithm
It is possible to simplify the above Bayesian model, by replacing equation (2) with and correspondingly replace equation (3) with . This eliminates the parameters entirely, so that we no longer need the prior values and .
This leads to a slightly modified posterior density , now given by where
This Simplified Bayesian algorithm was designed to be to similar to the original Bayesian algorithm. Surprisingly, we will see that experimentally it often performs very similarly to the pSALSA algorithm.
Experimental Results
We have implemented the algorithms presented here on various queries. Because of space limitations we only report here (see Appendix A) a representative subset of results; all of our results (including the queries "death penalty", "computational complexity" and "gun control" which are not reported here) can be obtained at
http://www.cs.toronto.edu/tsap/experiments. The reader may find it easier to follow the discussion in the next section by accessing the full set of results. For the generation of the Base Set of pages, we follow the specifications of Kleinberg [8] described earlier. For each of the queries, we begin by generating a Root Set that consists of the first 200 pages returned by AltaVista on the same query. The Root Set is then expanded to the Base Set by including nodes that point to, or are pointed to, by the pages in the Root Set. In order to keep the size of the Base Set manageable, for every page in the Root Set, we only include the first 50 pages returned from AltaVista that point to this page. We then construct the graph induced by nodes in the Base Set, by discovering all links among the pages in the Base Set, eliminating those that are between pages of the same domain.
For each query, we tested nine different algorithms on the same Base Set. We present the top ten authority sites returned by each of the algorithms. For evaluation purposes, we also include a list of the URL and title (possibly abbreviated) of each site which appears in the top five of one or more of the algorithms. For each page we also note the popularity of the page (denoted pop in the tables), that is, the number of different algorithms that rank it in the top ten sites. The pages that seem (to us) to be generally unrelated with the topic in hand appear boldfaced. We also present an "intersection table" which provides, for each pair of algorithms, the number of sites which were in the top ten according to both algorithms (maximum 10, minimum 0).
In the tables, SBayesian denotes the Simplified Bayesian algorithm, HubAvg denotes the HubAveraging Kleinberg algorithm, AThresh denotes the AuthorityThreshold algorithm, HThresh denotes the HubThreshold algorithm, and FThresh denotes the FullThreshold algorithm. For the
AuthorityThreshold and FullThreshold algorithms, we (arbitrarily) set the threshold .
Discussion of Experimental Results
We observe from the experiments that different algorithms emerge as the "best" for different queries, while there are queries for which no algorithm seems to perform well. One prominent such case is the query on "net censorship" (also on "computational complexity") where only a few of the top ten pages returned by any of the algorithms can possibly be considered as authoritative on the subject. One possible explanation is that in these cases the topic is not well represented on the web, or there is no strong interconnected community. This reinforces a common belief that any commercial search engine cannot rely solely on link information, but rather must also examine the text content of sites to prevent such difficulties as "topic drift". On the other hand, in cases such as "death penalty" (not shown here), all algorithms converge to almost the same top ten pages, which are both relevant and authoritative. In these cases the community is well represented, and strongly interconnected.
The experiments also indicate the difference between the behavior of the Kleinberg algorithm and pSALSA, first observed for the SALSA algorithm in the original paper of Lempel and Moran [9]. Specifically, when computing the top authorities, the Kleinberg algorithm tends to concentrate on a "tightly knit community" of nodes (the TKC effect), while pSALSA, like SALSA, tends to mix the authorities of different communities in the top authorities. The TKC effect becomes clear in the "genetic" query, where the Kleinberg algorithm only reports pages on biology in the top ten while pSALSA mixes these pages with pages on genetic algorithms. It also becomes poignantly clear in the "movies" query (and also in the "gun control" and the "abortion" query), where the top ten pages reported by the Kleinberg algorithm are dominated by an irrelevant cluster of nodes from the about.com community. A more elaborate algorithm for detecting intradomain links could help alleviate this problem. However, these examples seem indicative of the topic drift potential of the principal eigenvector in the Kleinberg algorithm.
On the other hand, the limitations of the pSALSA algorithm become obvious in the "computational geometry" query, where three out of the top ten pages belong to the unrelated w3.com community. They appear in the top positions because they are pointed to by a large collection of pages by ACM, which point to nothing else. A similar phenomenon explains the appearance of the "Yahoo!" page in the "genetic" query. We thus see that the simple heuristic of counting the indegree as the authority weight is also imperfect.
We identify two types of characteristic behavior: the Kleinberg behavior, and the pSALSA behavior. The former ranks the authorities based on the structure of the entire graph, and tends to favor the authorities of tightly knit communities. The latter ranks the authorities based on their popularity in their immediate neighborhood, and favors various authorities from different communities. To see how the rest of the algorithms fit within these two types of behaviors, we compare the behavior of algorithms on a pairwise basis, using the number of intersections in their respective top ten authorities as an indication of agreement.
The first striking observation is that the Simplified Bayesian algorithm is almost identical to the pSALSA algorithm. The pSALSA algorithm and the Simplified Bayesian have at least overlap on all queries. One possible explanation for this is that both algorithms place great importance on the indegree of a node when determining the authority weight of a node. For the pSALSA algorithm we know that it is "local" in nature, that is, the authority weight assigned to a node depends only on the links that point to this node, and not on the structure of the whole graph. The Simplified Bayesian seems to possess a similar, yet weaker property; we explore the locality issue further in the next section. On the other hand, the Bayesian algorithm appears to resemble both the Kleinberg and the pSALSA behavior, leaning more towards the first. Indeed, although the Bayesian algorithm avoids the severe topic drift in the "movies" and the "gun control" queries (but not in the "abortion" case), it usually has higher intersection numbers with Kleinberg than with pSALSA. One possible explanation for this observation is the presence of the parameters in the Bayesian algorithm (but not the Simplified Bayesian algorithm), which "absorb" some of the effect of many links pointing to a node, thus causing the authority weight of a node to be less dependent on its indegree.
Another algorithm that seems to combine characteristics of both the pSALSA and the Kleinberg behavior is the HubAveraging algorithm. The HubAveraging algorithm is by construction a hybrid of the two since it alternates between one step of each algorithm. It shares certain behavior characteristics with the Kleinberg algorithm: if we consider a full bipartite graph, then the weights of the authorities increase exponentially fast for HubAveraging (the rate of increase, however, is the square root of that of the Kleinberg algorithm). However, if the component becomes infiltrated, by making one of the hubs point to a node outside the component, then the weights of the authorities in the component drop. This prevents the HubAveraging algorithm from completely following the drifting behavior of the Kleinberg algorithm in the "movies" query. Nevertheless, in the "genetic" query, HubAveraging agrees strongly with Kleinberg, focusing on sites of a single community, instead of mixing communities as does pSALSA. On the other hand, HubAveraging and pSALSA share a common characteristic, since the HubAveraging algorithm tends to favor nodes with high indegree: if we consider an isolated component of one authority with high indegree, the authority weight of this node will increase exponentially fast. This explains the fact that the top three authorities for "computational geometry" are the w3.com pages that are also ranked highly by pSALSA (with HubAveraging giving a very high weight to all three authorities).
For the threshold algorithms, since they are modifications of the Kleinberg Algorithm, they are usually closer to the Kleinberg behavior. This is especially true for the HubThreshold algorithm. However, the benefit of eliminating unimportant hubs when computing authorities becomes obvious in the "abortion" query. If one looks further than the first ten pages returned by Kleinberg, one observes that after the first nine pages, which belong to the amazon.com community, the rest of the pages are on topic. The HubThreshold algorithm escapes this cluster, and moves directly to the relevant pages. The intersection between the top ten pages of HubThreshold, and the set of pages in the positions 10 to 20 in the Kleinberg algorithm is .
The AuthorityThreshold often appears to be most similar with the HubAveraging algorithm. This makes sense since these two algorithms have a similar underlying motivation. The best moment for AuthorityThreshold is the "movies" query, where it reports the most relevant top ten pages among all algorithms. The FullThreshold algorithm combines elements of both the Threshold algorithms; however, usually it reports in the top ten a mixture of the results of the two algorithms, rather than the best of the two.
Finally, the BFS algorithm is designed to be a generalization of the pSALSA algorithm, that combines some elements of the Kleinberg algorithm. Its behavior resembles both pSALSA and Kleinberg, with a tendency to favor pSALSA. In the "genetic" and "abortion" queries it demonstrates some mixing, but to a lesser extent than that of pSALSA. The most successful moments for BFS are the "abortion" and the "gun control" queries where it reports a set of top ten pages that are all on topic. An interesting question to investigate is how the behavior of the BFS algorithm is altered if we change the weighting scheme of the neighbors.
Theoretical Analysis
The experimental results of the previous section suggest that certain algorithms seem to share similar properties and ranking behavior. In this section, we initiate a (preliminary) formal study of fundamental properties and comparisons between ranking algorithms. For the purpose of following analysis we need some basic definitions and notation. Let be a collection of graphs of size . One special case is to let be the set of all graphs of size , hereafter denoted . We define a link analysis algorithm as a function that maps a graph to an dimensional vector. We call the vector the weight vector of algorithm on graph . The value of the entry of vector denotes the authority weight assigned by the algorithm to the page .
We can normalize the weight vector under some chosen norm. The choice of normalization affects the definition of some of the properties of the algorithms, so we discriminate between algorithms that use different norms. For any norm , we define the algorithm to be the algorithm , where the weight vector of is normalized under . We also examine unnormalized algorithms, where no normalization is performed at any stage of the algorithm. For example, the unnormalized pSALSA algorithm assigns a weight to page equal to the indegree of page . For the following discussion, when not stated explicitly, we will assume that the weight vectors of the algorithms are normalized under the norm (i.e. each weight is divided by the maximum weight); this gives weight 1 to the top authority, with other weights given as a fraction of the top weight. Due to space constraints, many proofs have been omitted in the following sections.
Monotonicity
Monotonicity appears to be a "reasonable" property that any sensible linkanalysis algorithm should satisfy.
Similarity
Let and be two algorithms on . We shall consider the distance between the weight vectors of and , for , where is some function that maps the weight vectors and to a real number . We shall consider the Manhattan distance , that is, the norm of the difference of the weight vectors, given by .
For this distance function, we now define the similarity between two algorithms as follows.
We also consider another distance function that attempts to capture the similarity between the ordinal rankings of two algorithms. The motivation behind this definition is that the ordinal ranking is the usual endproduct seen by the user. Let and be the weight vectors of two algorithms and . We define the indicator function as follows
We note that if and only if . becomes one for each pair of nodes that are ranked differently. We define the "ranking distance" function as follows.
Note that, unlike , the distance between two weight vectors under does not depend upon the choice of normalization.
On the other hand, we have the following.
Let be the set of all size nested graphs. (Of course, is a rather restricted set of size graphs.)
Stability and Locality
In the previous section we examined the similarity of two different algorithms on the same graph . In this section we are interested on how the output of a fixed algorithm changes, as we alter the graph. We would like small changes in the graph to have a small effect on the weight vector of the algorithm. We capture this requirement by the definition of stability. For the following, let denote the set of all edges (i.e. links) in the graph . We assume that , otherwise all properties that we discuss below are trivial. The following definition applies to unnormalized, and algorithms.
Stability may be a desirable property. Indeed, the algorithms all act on a base set which is generated using some other search engine (e.g. AltaVista [1]) and the associated hypertext links. Presumably with a "very good" base set, all the algorithms would perform well. However, if an algorithm is not stable, then slight changes in the base set (or its link structure) may lead to large changes in the rankings given by the algorithm. Thus, stability may provide "protection" from poor base sets. We note that the parameter used in the definition of stability allows for an arbitrary scaling of the second weight vector, thus eliminating instability which is caused solely by different normalization factors.
Averaging algorithms are neither stable, nor rank stable.
We now introduce the idea of "locality". The idea behind locality is that a change in the inlinks of a node should have only a small effect on the weights of the rest of the nodes.
We note that locality depends on the normalization used, but pairwise locality and rank locality do not. The following lemmas are direct consequences of the definitions.
We have the following.
We originally thought that the Bayesian and Simplified Bayesian Algorithms were also local. However, it turns out that they are neither local nor pairwise local. Indeed, it is true that conditional on the values of , , and , the conditional distribution of for is unchanged upon removing a link from to . However, the unconditional marginal distribution of , and hence also its posterior mean (or even ratios for ), may still be changed upon removing a link from to . (Indeed, we have computed experimentally that may change upon removing a link from 1 to 2, even for a simple example with just four nodes.) Hence, neither the Bayesian nor the Simplified Bayesian Algorithm is local or pairwise local.
Finally, we use locality and "labelindependence" to prove a uniqueness property of the pSALSA algorithm.
All of our algorithms are clearly labelindependent, and we would expect this property of any reasonable algorithm. We have the following.
Symmetry
We have by inspection:
Summary
We have considered a number of known and some new algorithms which use the hypertext link structure of World Wide Web pages to extract information about the relative ranking of these pages. In particular, we have introduced two algorithms based on Bayesian statistical approach as well as a number of algorithms which are modifications of Kleinberg's seminal hubs and authority algorithm. Based on 8 different queries (5 presented here), we discuss some observed properties of each algorithm as well as relationships between the algorithms. We found (experimentally) that certain algorithms appear to be more "balanced", while others more "focused". The latter tend to be sensitive to the existence of tightly interconnected clusters, which may cause them to drift. The intersections between the lists of the topten results of the algorithms suggest that certain algorithms exhibit similar behavior and properties.
Motivated by the experimental observations, we introduced a theoretical framework for the study and comparison of linkanalysis ranking algorithms. We formally defined (and gave some preliminary results for) the concepts of monotonicity and locality, as well as various concepts of distance and similarity between ranking algorithms.
Our work leaves a number of interesting open questions. The two Bayesian algorithms open the door to the use of other statistical and machine learning techniques for ranking of hyperlinked documents. Furthermore, the framework we defined suggests a number of interesting directions for the theoretical study of ranking algorithms, which we have just begun to explore in this work.
Bibliography
 1
 AltaVista Company.
AltaVista search engine.
http://www.altavista.com.  2
 J.M. Bernardo and A.F.M. Smith.
Bayesian Theory.
John Wiley & Sons, Chichester, England, 1994.  3
 S. Brin and L. Page.
The anatomy of a largescale hypertextual Web search engine.
In 7th International World Wide Web Conference, Brisbane, Australia, April 1998.  4
 D. Cohn and H. Chang.
Learning to probabilistically identify authoritative documents.
Preprint, 2000.  5
 A. Dempster, N. Laird, and D. Rubin.
Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, Series B, 39:138, 1977.  6
 W.R. Gilks, S. Richardson, and D.J. Spiegelhalter.
Markov Chain Monte Carlo in practice.
Chapman and Hall, London, 1996.  7
 Google.
Google search engine.
http://www.google.com.  8
 J. Kleinberg.
Authoritative sources in a hyperlinked environment.
Journal of ACM (JASM), 46, 1999.  9
 R. Lempel and S. Moran.
The stochastic approach for linkstructure analysis (SALSA) and the TKC effect.
In 9th International World Wide Web Conference, Amsterdam, Netherlands, May 2000.  10
 R. Motwani and P. Raghavan.
Randomized Algorithms.
Cambridge University Press, Cambridge, England, June 1995.  11
 D. Rafiei and A. Mendelzon.
What is this page known for? Computing web page reputations.
In 9th International World Wide Web Conference, Amsterdam, Netherlands, May 2000.  12
 G.O. Roberts and J.S. Rosenthal.
Markov chain Monte Carlo: Some practical implications of theoretical results (with discussion).
Canadian Journal of Statistics, 26:531, 1998.  13
 A.F.M. Smith and G.O. Roberts.
Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion).
Journal of the Royal Statistical Society, Series B, 55:324, 1993.  14
 L. Tierney.
Markov chains for exploring posterior distributions (with discussion).
Annals of Statistics, 22:17011762, 1994.
Experiments
Query: abortion (Base Set size = 2293)

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
1. 
P1165 
P717 
P1165 
P1165 
P717 
P1165 
P717 
P717 
P717 
2. 
P1184 
P1461 
P1184 
P1184 
P1461 
P1184 
P719 
P1461 
P1192 
3. 
P1193 
P719 
P1193 
P1193 
P1769 
P1193 
P1769 
P1191 
P1165 
4. 
P1187 
P1165 
P1187 
P1187 
P719 
P1461 
P1461 
P719 
P1191 
5. 
P1188 
P1184 
P1188 
P1188 
P1 
P719 
P962 
P1184 
P1193 
6. 
P1189 
P1193 
P1189 
P1189 
P718 
P717 
P0 
P1165 
P1184 
7. 
P1190 
P1187 
P1190 
P1190 
P0 
P1 
P2 
P1192 
P1189 
8. 
P1191 
P1188 
P1191 
P1191 
P115 
P0 
P718 
P1193 
P1188 
9. 
P1192 
P1189 
P1192 
P1192 
P2515 
P115 
P1325 
P1188 
P1187 
10. 
P717 
P1190 
P1948 
P1948 
P962 
P607 
P1522 
P1187 
P1190 
Index 
pop 
URL 
Title 
P0 
3 
www.gynpages.com 
Abortion Clinics OnLine 
P1 
2 
www.prochoice.org 
NAF  The Voice of Abortion Providers 
P2 
1 
www.cais.com/agm/main 
The Abortion Rights Activist Home Page 
P115 
2 
www.ms4c.org 
Medical Students for Choice 
P607 
1 
www.feministcampus.org 
Feminist Campus Activism Online: Welcome 
P717 
7 
www.nrlc.org 
National Right to Life Organization 
P718 
2 
www.hli.org 
Human Life International (HLI) 
P719 
5 
www.naral.org 
NARAL: Abortion and Reproductive Rights: ... 
P962 
2 
www.prolife.org/ultimate 
Empty title field 
P1165 
7 
www5.dimeclicks.com 
DimeClicks.com  Web and Marketing Solutions 
P1184 
7 
www.amazon.com/...../youdebatecom 
Amazon.comEarth's Biggest Selection 
P1187 
6 
www.amazon.com/...../topsellers.html 
Amazon.comEarth's Biggest Selection 
P1188 
6 
www.amazon.com/.../software/home.html 
Amazon.com Software 
P1189 
5 
www.amazon.com/.../hot100music.html 
Amazon.comEarth's Biggest Selection 
P1190 
5 
www.amazon.com/.../gifts.html 
Amazon.comEarth's Biggest Selection 
P1191 
5 
www.amazon.com/.....top100dvd.html 
Amazon.comEarth's Biggest Selection 
P1192 
5 
www.amazon.com/...top100video.html 
Amazon.comEarth's Biggest Selection 
P1193 
7 
rd1.hitbox.com/....... 
HitBox.com  hitbox web site ....... 
P1325 
1 
www.serve.com/fem4life 
Feminists For Life of America 
P1461 
5 
www.plannedparenthood.org 
Planned Parenthood Federation of America 
P1522 
1 
www.naralny.org 
NARAL/NY 
P1769 
2 
www.priestsforlife.org 
Priests for Life Index 
P1948 
2 
www.politics1.com/issues.htm 
Politics1: Hot Political Debates & Issues 
P2515 
1 
www.ohiolife.org 
Ohio Right To Life 

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
Kleinberg 
10 
8 
9 
9 
1 
4 
1 
8 
10 
pSALSA 
8 
10 
7 
7 
3 
6 
3 
8 
8 
HubAvg 
9 
7 
10 
10 
0 
3 
0 
7 
9 
AThresh 
9 
7 
10 
10 
0 
3 
0 
7 
9 
HThresh 
1 
3 
0 
0 
10 
6 
7 
3 
1 
FThresh 
4 
6 
3 
3 
6 
10 
4 
6 
4 
BFS 
1 
3 
0 
0 
7 
4 
10 
3 
1 
SBayesian 
8 
8 
7 
7 
3 
6 
3 
10 
8 
Bayesian 
10 
8 
9 
9 
1 
4 
1 
8 
10 
Query: +net +censorship (Base Set size = 2947)

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
1. 
P375 
P371 
P371 
P371 
P375 
P375 
P371 
P371 
P375 
2. 
P3163 
P1440 
P2874 
P375 
P1344 
P371 
P375 
P1440 
P371 
3. 
P3180 
P375 
P2871 
P2871 
P3132 
P1344 
P1299 
P375 
P3180 
4. 
P3177 
P2874 
P2873 
P2874 
P3163 
P3130 
P1440 
P2874 
P3177 
5. 
P3173 
P2871 
P2659 
P3536 
P3166 
P3131 
P2871 
P1299 
P3163 
6. 
P3172 
P1299 
P375 
P2873 
P3167 
P3132 
P2874 
P2871 
P3173 
7. 
P3132 
P3536 
P3536 
P2659 
P3168 
P3133 
P3536 
P3536 
P3166 
8. 
P3193 
P1712 
P2639 
P2639 
P3170 
P3135 
P1802 
P1712 
P3193 
9. 
P3170 
P268 
P1440 
P1440 
P3171 
P3161 
P2639 
P268 
P3168 
10. 
P3166 
P1445 
P2867 
P3180 
P3172 
P3162 
P452 
P1445 
P3132 
Index 
pop 
URL 
Title 
P268 
2 
www.epic.org 
Electronic Privacy Information Center 
P371 
7 
www.yahoo.com 
Yahoo! 
P375 
9 
www.cnn.com 
CNN.com 
P452 
1 
www.mediachannel.org 
MediaChannel.org  A Global Network ........ 
P1299 
3 
www.eff.org/blueribbon.html 
EFF Blue Ribbon Campaign 
P1344 
2 
www.igc.apc.org/peacenet 
PeaceNet Home 
P1440 
5 
www.eff.org 
EFF ...  the Electronic Frontier Foundation 
P1445 
2 
www.cdt.org 
The Center for Democracy and Technology 
P1712 
2 
www.aclu.org 
ACLU: American Civil Liberties Union 
P1802 
1 
ukonlineshop.about.com 
Online Shopping: UK 
P2639 
3 
www.imdb.com 
The Internet Movie Database (IMDb). 
P2659 
2 
www.altavista.com 
AltaVista  Welcome 
P2867 
1 
home.netscape.com 
Empty title field 
P2871 
5 
www.excite.com 
My Excite Start Page 
P2873 
2 
www.mckinley.com 
Welcome to Magellan! 
P2874 
5 
www.lycos.com 
Lycos 
P3130 
1 
www.city.net/countries/kyrgyzstan 
Excite Travel 
P3131 
1 
www.bishkek.su/krg/Contry.html 
ElCat. 404: Not Found. 
P3132 
4 
www.pitt.edu/~cjp/rees.html 
REESWeb: Programs: 
P3133 
1 
www.ripn.net 
RIPN 
P3135 
1 
www.yahoo.com/.../Kyrgyzstan 
Yahoo! Regional Countries Kyrgyzstan 
P3161 
1 
151.121.3.140/fas/faspublications/... 
Error 404 Redirector 
P3162 
1 
www.rferl.org/BD/KY 
RFE/RL Kyrgyz Service : News 
P3163 
3 
www.usa.ft.com 
Empty title field 
P3166 
3 
www.pathfinder.com/time/daily 
TIME.COM 
P3167 
1 
travel.state.gov 
US State Department  Services  Consular Affairs 
P3168 
2 
www.yahoo.com/News 
Yahoo! News and Media 
P3170 
2 
www.financenet.gov 
...FinanceNet is the government's official home... 
P3171 
1 
www.securities.com 
ISI Emerging Markets 
P3172 
2 
www.oecd.org 
OECD Online 
P3173 
2 
www.worldbank.org 
The World Bank Group 
P3177 
2 
www.envirolink.org 
EnviroLink Network 
P3180 
3 
www.lib.utexas.edu/.../Map_collection 
PCL Map Collection 
P3193 
2 
www.wiesenthal.com 
Simon Wiesenthal Center 
P3536 
5 
www.shareware.com 
CNET.com  Shareware.com 

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
Kleinberg 
10 
1 
1 
2 
6 
2 
1 
1 
8 
pSALSA 
1 
10 
6 
6 
1 
2 
7 
10 
2 
HubAvg 
1 
6 
10 
9 
1 
2 
7 
6 
2 
AThresh 
2 
6 
9 
10 
1 
2 
7 
6 
3 
HThresh 
6 
1 
1 
1 
10 
3 
1 
1 
5 
FThresh 
2 
2 
2 
2 
3 
10 
2 
2 
3 
BFS 
1 
7 
7 
7 
1 
2 
10 
7 
2 
SBayesian 
1 
10 
6 
6 
1 
2 
7 
10 
2 
Bayesian 
8 
2 
2 
3 
5 
3 
2 
2 
10 
Query: Movies (Base Set size = 5757)

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
1. 
P678 
P999 
P999 
P999 
P678 
P999 
P999 
P999 
P999 
2. 
P2268 
P2832 
P2832 
P2832 
P2266 
P2832 
P2832 
P2832 
P2832 
3. 
P2304 
P6359 
P803 
P6359 
P2268 
P1989 
P2827 
P6359 
P2827 
4. 
P2305 
P2827 
P1539 
P2827 
P999 
P1911 
P2803 
P2827 
P6359 
5. 
P2306 
P2120 
P2101 
P2838 
P2832 
P1980 
P5470 
P1374 
P678 
6. 
P2308 
P1374 
P1178 
P6446 
P2263 
P1983 
P2120 
P2120 
P2838 
7. 
P2310 
P803 
P6359 
P5 
P2264 
P1984 
P4577 
P803 
P2266 
8. 
P2266 
P1539 
P1082 
P2803 
P2265 
P1986 
P5 
P6446 
P2268 
9. 
P2325 
P6446 
P2827 
P2839 
P2280 
P1987 
P2838 
P1539 
P2308 
10. 
P2299 
P2838 
P6446 
P2840 
P2304 
P1993 
P4534 
P2838 
P2330 
Index 
pop 
Title 
URL 
P5 
2 
www.movies.com 
Movies.com 
P678 
3 
chatting.about.com 
Empty title field 
P803 
3 
www.google.com 

P999 
8 
www.moviedatabase.com 
The Internet Movie Database (IMDb). 
P1082 
1 
www.amazon.com/ 
Amazon.comEarth's Biggest Selection 
P1178 
1 
www.booksfordummies.com 
Empty title field 
P1374 
2 
www.onwisconsin.com 
On Wisconsin 
P1539 
3 
206.132.25.51 
Washingtonpost.com  News Front 
P1911 
1 
people2people.com/...nytoday 
People2People.com  Search 
P1980 
1 
newyork.urbanbaby.com/nytoday 
Kids & Family 
P1983 
1 
tunerc1.va.everstream.com/nytoday/ 
Empty title field 
P1984 
1 
nytoday.opentable.com/ 
OpenTable 
P1986 
1 
www.nytimes.com/.../jobmarket 
The New York Times: Job Market 
P1987 
1 
www.cars.com/nytimes 
New York Today cars.com  new and used car ... 
P1989 
1 
www.nytodayshopping.com 
New York Today Shopping  Shop for computers, ... 
P1993 
1 
www.nytimes.com/.../nytodaymediakit 
New York Today  Online Media Kit 
P2101 
1 
www2.ebay.com/aw/announce.shtml 
eBay Announcement Board 
P2120 
3 
www.mylifesaver.com 
welcome to mylifesaver.com 
P2263 
1 
clicks.about.com/...nationalinterbank 
Banking Center 
P2264 
1 
clicks.about.com/ 
Credit Report, Free Trial Offer 
P2265 
1 
membership.about.com/... 
Member Center 
P2266 
3 
home.about.com/movies 
About  Movies 
P2268 
3 
azlist.about.com 
About.com AZ 
P2280 
1 
sprinks.about.com 
Sprinks : About Sprinks 
P2299 
1 
home.about.com/aboutaus 
About Australia 
P2304 
2 
home.about.com/arts 
About  Arts/Humanities 
P2305 
1 
home.about.com/autos 
About  Autos 
P2306 
1 
home.about.com/citiestowns 
About  Cities/Towns 
P2308 
2 
home.about.com/compute 
About  Computing/Technology 
P2310 
1 
home.about.com/education 
About  Education 
P2325 
1 
home.about.com/musicperform 
About  Music/Performance 
P2330 
1 
home.about.com/recreation 
About  Recreation/Outdoors 
P2803 
2 
www.allmovie.com 
All Movie Guide 
P2827 
6 
www.film.com 
Film.com Movie Reviews, News, Trailers... 
P2832 
8 
www.hollywood.com 
Hollywood.com  Your entertainment source... 
P2838 
5 
www.mca.com 
Universal Studios 
P2839 
1 
www.mgmua.com 
MGM  Home Page 
P2840 
1 
www.miramax.com 
Welcome to the Miramax Cafe 
P4534 
1 
www.aintitcoolnews.com 
Ain't It Cool News 
P4577 
1 
go.com 
GO.com 
P5470 
1 
www.doubleaction.net 
Double Action  Stand. Point. Laugh. 
P6359 
5 
www.paramount.com 
Paramount Pictures  Home Page 
P6446 
4 
www.disney.com 
Disney.com  Where the Magic Lives Online! 

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
Kleinberg 
10 
0 
0 
0 
4 
0 
0 
0 
4 
pSALSA 
0 
10 
7 
6 
2 
2 
5 
10 
5 
HubAvg 
0 
7 
10 
5 
2 
2 
3 
7 
4 
AThresh 
0 
6 
5 
10 
2 
2 
6 
6 
5 
HThresh 
4 
2 
2 
2 
10 
2 
2 
2 
5 
FThresh 
0 
2 
2 
2 
2 
10 
2 
2 
2 
BFS 
0 
5 
3 
6 
2 
2 
10 
5 
4 
SBayesian 
0 
10 
7 
6 
2 
2 
5 
10 
5 
Bayesian 
4 
5 
4 
5 
5 
2 
4 
5 
10 
Query: genetic (Base Set size = 3468)

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
1. 
P2187 
P2187 
P2187 
P2187 
P2187 
P2187 
P2187 
P2187 
P2187 
2. 
P1057 
P258 
P1057 
P1057 
P1057 
P1057 
P3932 
P258 
P1057 
3. 
P2168 
P1057 
P3932 
P2168 
P2168 
P2168 
P1538 
P1057 
P2168 
4. 
P2200 
P3932 
P2095 
P2200 
P2200 
P2200 
P1057 
P3932 
P2095 
5. 
P2219 
P2095 
P2168 
P2219 
P2219 
P2219 
P2095 
P2095 
P2200 
6. 
P2199 
P1538 
P2186 
P2095 
P2199 
P2095 
P258 
P1538 
P2219 
7. 
P2095 
P2 
P941 
P3932 
P2186 
P3932 
P2168 
P2 
P3932 
8. 
P2186 
P2168 
P0 
P2199 
P2095 
P2199 
P2200 
P2168 
P2199 
9. 
P2193 
P941 
P2200 
P2186 
P2193 
P2186 
P2 
P941 
P2186 
10. 
P3932 
P23 
P2199 
P2193 
P3932 
P2193 
P2199 
P2200 
P2193 
Index 
pop 
URL 
Title 
P0 
1 
www.geneticalliance.org 
Genetic Alliance, Washington, DC 
P2 
3 
www.geneticprogramming.org 
geneticprogramming.orgHomePage 
P23 
1 
www.geneticprogramming.com 
The Genetic Programming Notebook 
P258 
3 
www.aic.nrl.navy.mil/galist 
The Genetic Algorithms Archive 
P941 
3 
www3.ncbi.nlm.nih.gov/Omim 
OMIM Home Page  Online Mendelian Inheritance in Man 
P1057 
9 
gdbwww.gdb.org 
The Genome Database 
P1538 
3 
www.yahoo.com 
Yahoo! 
P2095 
9 
www.nhgri.nih.gov 
National Human Genome Research Institute (NHGRI) 
P2168 
9 
wwwgenome.wi.mit.edu 
Welcome To the ..... Center for Genome Research 
P2186 
6 
www.ebi.ac.uk 
EBI, the European Bioinformatics Institute ........ 
P2187 
9 
www.ncbi.nlm.nih.gov 
NCBI HomePage 
P2193 
5 
www.genome.ad.jp 
GenomeNet WWW server 
P2199 
7 
www.hgmp.mrc.ac.uk 
UK MRC HGMPRC 
P2200 
8 
www.tigr.org 
The Institute for Genomic Research 
P2219 
5 
www.sanger.ac.uk 
The Sanger Centre Web Server 
P3932 
9 
www.nih.gov 
National Institutes of Health (NIH) 

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
Kleinberg 
10 
5 
8 
10 
10 
10 
7 
6 
10 
pSALSA 
5 
10 
6 
5 
5 
5 
8 
9 
5 
HubAvg 
8 
6 
10 
8 
8 
8 
7 
7 
8 
AThresh 
10 
5 
8 
10 
10 
10 
7 
6 
10 
HThresh 
10 
5 
8 
10 
10 
10 
7 
6 
10 
FThresh 
10 
5 
8 
10 
10 
10 
7 
6 
10 
BFS 
7 
8 
7 
7 
7 
7 
10 
9 
7 
SBayesian 
6 
9 
7 
6 
6 
6 
9 
10 
6 
Bayesian 
10 
5 
8 
10 
10 
10 
7 
6 
10 
Query: +computational +geometry (Base Set size = 1226)

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
1. 
P161 
P161 
P634 
P161 
P0 
P161 
P0 
P161 
P161 
2. 
P0 
P1 
P632 
P0 
P161 
P0 
P161 
P1 
P0 
3. 
P1 
P0 
P633 
P1 
P1 
P1 
P1 
P0 
P1 
4. 
P162 
P3 
P161 
P162 
P162 
P162 
P300 
P3 
P162 
5. 
P3 
P280 
P1 
P3 
P280 
P280 
P299 
P280 
P3 
6. 
P280 
P634 
P1406 
P280 
P3 
P3 
P162 
P634 
P280 
7. 
P275 
P162 
P0 
P275 
P275 
P275 
P3 
P162 
P275 
8. 
P299 
P2 
P3 
P299 
P299 
P299 
P280 
P2 
P299 
9. 
P300 
P632 
P162 
P300 
P300 
P848 
P375 
P633 
P300 
10. 
P848 
P633 
P280 
P848 
P848 
P300 
P551 
P632 
P848 
Index 
pop 
URL 
Title 
P0 
9 
www.geom.umn.edu/software/cglist 
Directory of Computational Geometry Software 
P1 
9 
www.cs.uu.nl/CGAL 
The former CGAL home page 
P2 
2 
link.springer.de/link/service/journals/00454 
LINK: Peaktime overload 
P3 
9 
www.scs.carleton.ca/~csgs/resources/cg.html 
Computational Geometry Resources 
P161 
9 
www.ics.uci.edu/~eppstein/geom.html 
Geometry in Action 
P162 
9 
www.ics.uci.edu/~eppstein/junkyard 
The Geometry Junkyard 
P275 
5 
www.ics.uci.edu/~eppstein 
David Eppstein 
P280 
9 
www.geom.umn.edu 
The Geometry Center Welcome Page 
P299 
6 
www.mpisb.mpg.de/LEDA/leda.html 
LEDA  Main Page of LEDA Research 
P300 
6 
www.cs.sunysb.edu/~algorith 
The Stony Brook Algorithm Repository 
P375 
1 
graphics.lcs.mit.edu/~seth 
Seth Teller 
P551 
1 
www.cs.sunysb.edu/~skiena 
Steven Skiena 
P632 
3 
www.w3.org/Style/CSS/Buttons 
CSS button 
P633 
3 
jigsaw.w3.org/cssvalidator 
W3C CSS Validation Service 
P634 
3 
validator.w3.org 
W3C HTML Validation Service 
P848 
5 
www.inria.fr/prisme/....../cgt 
CG Tribune 
P1406 
1 
www.informatik.rwthaachen.de/..... 
Department of Computer Science, Aachen 

Kleinberg 
pSALSA 
HubAvg 
AThresh 
HThresh 
FThresh 
BFS 
SBayesian 
Bayesian 
Kleinberg 
10 
6 
6 
10 
10 
10 
8 
6 
10 
pSALSA 
6 
10 
9 
6 
6 
6 
6 
10 
6 
HubAvg 
6 
9 
10 
6 
6 
6 
6 
9 
6 
AThresh 
10 
6 
6 
10 
10 
10 
8 
6 
10 
HThresh 
10 
6 
6 
10 
10 
10 
8 
6 
10 
FThresh 
10 
6 
6 
10 
10 
10 
8 
6 
10 
BFS 
8 
6 
6 
8 
8 
8 
10 
6 
8 
SBayesian 
6 
10 
9 
6 
6 
6 
6 
10 
6 
Bayesian 
10 
6 
6 
10 
10 
10 
8 
6 
10 
Footnotes
 ... SALSA)
 We thank Ronny Lempel and Shlomo Moran for pointing out the difference between SALSA and pSALSA.
 ... domain
 If one modifies the way the Base Set or the graph is constructed, the results of the algorithms can vary dramatically. In the abovementioned site we report the output of the algorithms for the same query, over different graphs.
 ... pSALSA
 In a version of the "abortion" query (denoted "refined" in the site), the HubAveraging algorithm did some mixing of communities, but to a smaller degree than pSALSA.
 ... follows
 The definition of similarity can be generalized to any distance function , and any normalization norm , by requiring instead that as , where is the maximum distance between any two vectors with unit norm. Most of our results can be generalized to any distance function and any norm.
 ...algorithms
 As in the case of similarity, the notion of stability can be defined for any distance function, and for any normalization norm.
Panayiotis Tsaparas 20010417