## The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect

###### Abridged Version

###### R. Lempel S. Moran

Department of Computer Science

The Technion,

Haifa 32000, Israel

email: {rlempel,moran}@cs.technion.ac.il

### ABSTRACT

Today, when searching for information on the WWW, one usually performs a query through a term-based search engine. These engines return, as the query's result, a list of Web sites whose contents matches the query. For broad topic queries, such searches often result in a huge set of retrieved documents, many of which are irrelevant to the user. However, much information is contained in the link-structure of the WWW. Information such as which pages are linked to others can be used to augment search algorithms. In this context, Jon Kleinberg introduced the notion of two distinct types of Web-sites: *hubs* and *authorities*. Kleinberg argued that hubs and authorities exhibit a *mutually reinforcing relationship*: A good hub will point to many authorities, and a good authority will be pointed at by many hubs. In light of this, he devised an algorithm aimed at finding authoritative sites.

We present SALSA - a new stochastic approach for link structure analysis, which examines random walks on graphs derived from the link structure. We show that both SALSA and Kleinberg's Mutual Reinforcement approach employ the same meta-algorithm. We then prove that SALSA is equivalent to a weighted in-degree analysis of the link-structure of WWW subgraphs, making it computationally more efficient than the Mutual Reinforcement approach.

We compare the results of applying SALSA to the results derived through Kleinberg's approach.

These comparisons reveal a topological phenomenon called the *TKC Effect* which, in certain cases,
prevents the Mutual Reinforcement approach from identifying meaningful authorities.

#### Keywords: Information Retrieval; Link Structure Analysis; Hubs and Authorities; Random walks; SALSA.

### 1. Introduction

##### Searching the WWW - The Challenge

The WWW is a rapidly expanding hyperlinked collection of unstructured information. The lack of structure and the enormous volume of the WWW pose tremendous challenges on the WWW Information Retrieval systems called search engines. These search engines are presented with queries, and return a list of Web-sites which are deemed (by the engine) to pertain to the query.

When considering the difficulties which WWW search engines face, we distinguish between narrow-topic queries and broad-topic queries. This distinction pertains to the presence which the query's topic has on the Web:

Narrow topic queries are queries for which very few resources exist on the Web, and which present a "needle in the haystack" challenge for search engines. An example for such a query is an attempt to locate the lyrics of a specific song, by quoting a line from it ("We all live in a yellow submarine"). Search engines encounter a *recall* challenge when handling such queries - Finding the few resources which pertain to the query.

On the other hand, broad-topic queries pertain to topics for which there is an abundance of information on the Web, sometimes as many as millions of relevant resources (with varying degrees of relevance). The vast majority of users are not interested in retrieving the entire huge set of resources - most users will be quite satisfied with a few *authoritative* results: Web sites which are highly relevant to the topic of the query, significantly more than most other sites. The challenge which search engines face here is one of *precision* - Retrieving only the most relevant resources to the query.

This work focuses on finding authoritative resources which pertain to broad-topic queries.

##### Term-based search engines

Term-based search engines face both classical problems in Information Retrieval, as well as problems specific to the WWW setting, when handling broad-topic queries. The classic problems include the following issues ([20],[4]):

- Synonymy - Retrieving documents containing the term "car" when given the query "automobile".
- Polysemy/Ambiguity - When given the query "Jordan", should the engine retrieve pages pertaining to the Hashemite Kingdom of Jordan, or pages pertaining to basketball legend Michael Jordan?
- Authorship styles - This is a generalization of the synonymy issue. Two documents, which pertain to the same topic, can sometimes use very different vocabularies and figures of speech when written by different authors (as an example, the styles of two documents, one written in British English and the other in American English, might differ considerably).

In addition to the classical issues in Information Retrieval, there is a Web-specific obstacle which search engines must overcome, called *search engine persuasion* ([19]). There may be millions of sites pertaining in some manner to broad-topic queries, but most users will only browse through the first ten results returned by their favorite search facility. With the growing economic impact of the WWW, and the growth of e-commerce, it is crucial for businesses to have their sites ranked high by the major search engines. There are quite a few companies who sell this kind of expertise - They design Web sites which are tailored to rank high with specific queries on the major search engines. These companies research the ranking algorithms and heuristics of term-based engines, and know how many keywords to place (and where) in a Web-page so as to improve the page's ranking (which directly impacts the page's visibility). A less sophisticated technique, used by some site creators, is called *keyword spamming* ([4]). Here, the authors repeat certain terms (some of which are only remotely connected to their site's context), in order to "lure" search engines into ranking them highly for many queries.

##### Informative link structure - The answer?

The WWW is a hyperlinked collection. In addition to the textual content of the individual pages, the link structure of such collections contains information which can, and should, be tapped when searching for authoritative sources. Consider the significance of a link *p**q*: With such a link *p* suggests, or even recommends, that surfers visiting *p* follow the link and visit *q*. This may reflect the fact that pages *p* and *q* share a common topic of interest, and that the author of *p* thinks highly of *q*'s contents. Such a link, called an *informative link*, is *p*'s way to confer authority on *q* ([16]). Note that informative links provide a positive critical assessment of *q*'s contents which originates from outside the control of the author of *q* (as opposed to assessments based on *q*'s textual content, which is under complete control of *q*'s author). This makes the information extracted from informative links less vulnerable to manipulative techniques such as spamming.

Unfortunately, not all links are informative. There are many kinds of links which confer little or no authority ([4]), such as intra-domain (inner) links (whose purpose is to provide navigational aid in a complex Web site of some organization), commercial/sponsor links, and links which result from link-exchange agreements. A crucial task which should be completed prior to analyzing the link structure of a given collection, is to filter out as many of the non-informative links as possible.

##### Related work on link structures

Prior to the WWW age, link structures were studied in the area of bibliometrics, which studies the citation structure of written documents ([23],[15]). Many works in this area were aimed at finding high-impact papers published in scientific journals ([10]), and at clustering related documents ([1])

Some works have studied the Web's link structure, in addition to the textual content of the pages, as means to visualize areas thought to contain good resources ([3]). Other works used link structures for categorizing pages and clustering them ([24],[21]).

Marchiori, in [19], uses the link-structure of the Web to enhance search results of term-based search engines. This is done by considering the potential hyper-information contained in each Web-page: The information that can be found when following hyperlinks which originate in the page.

This work is motivated by the approach introduced by Jon Kleinberg ([16]). In an attempt to impose some structure on the chaotic WWW, Kleinberg distinguished between two types of Web-sites which pertain to a certain topic: The first are *authoritative* pages in the sense described previously. The second type of sites are *hub* pages. Hubs are resource lists - They do not directly contain information pertaining to the topic, but rather point to many authoritative sites. According to this model, hubs and authorities exhibit a *mutually reinforcing relationship*: Good hubs point to many good authorities, and good authorities are pointed at by many good hubs. In light of the mutually reinforcing relationship, hubs and authorities should form communities, which can be pictured as dense bipartite portions of the Web, where the hubs link densely to the authorities. The most prominent community in a WWW subgraph is called the *principal community* of the collection.

Kleinberg suggested an algorithm to identify these communities, which is described in detail in section 2.

Researchers from IBM's Almaden Research Center have implemented Kleinberg's algorithm in various projects. The first was *HITS*, which is described in [11], and offers some enlightening practical remarks. The *ARC* system, described in [7], augments Kleinberg's link-structure analysis by considering also the anchor text, the text which surrounds the hyperlink in the pointing page. The reasoning behind this is that many times, the pointing page describes the destination page's contents around the hyperlink, and thus the authority conferred by the links can be better assessed. These projects were extended by the *CLEVER* project ([14]). Researchers from outside IBM, such as Henzinger and Brahat, have also studied Kleinberg's approach and have proposed improvements to it ([13]).

Anchor text has also been used by Brin and Page in [2]. Another major feature of their work on the *Google* search engine ([12]) is a link-structure based site ranking approach called *PageRank*, which can be interpreted as a stochastic analysis of some random-walk behavior through the entire WWW.

In [18], the authors use the links surrounding a small set of same-topic sites to assemble a larger collection of neighboring pages which should contain many authoritative resources on the initial topic. The textual content of the collection is then analyzed in ranking the relevancy of its individual pages.

##### This work

While preserving the theme that Web sites pertaining to a given topic should be split to hubs and authorities, we replace Kleinberg's Mutual Reinforcement approach of [16] by a new stochastic approach (SALSA), in which the coupling between hubs and authorities is less tight. The intuition behind our approach is the following: consider a bipartite graph *G*, whose two parts correspond to hubs and authorities, where an edge between hub *r* and authority *s* means that there is an informative link from *r* to *s*. Then, authorities and hubs pertaining to the dominant topic of the sites in *G* should be highly visible (reachable) from many sites in *G*. Thus, we will attempt to identify these sites by examining certain random walks in *G*, under the proviso that such random walks will tend to visit these highly visible sites more frequently than other, less connected sites. We show that in finding the principal communities of hubs and authorities, both Kleinberg's Mutual Reinforcement approach and our Stochastic approach employ the same meta-algorithm on different representations of the input graph. We then compare the results of applying SALSA to the results derived by Kleinberg's approach. Through these comparisons, we isolate a particular topological phenomenon which we call the *Tightly Knit Community (TKC) Effect*. In certain scenarios, this effect hampers the ability of the the Mutual Reinforcement approach to identify meaningful authorities. We demonstrate that SALSA is less vulnerable to the TKC effect, and can find meaningful authorities in collections where the Mutual Reinforcement approach fails to do so.

After demonstrating some results achieved by means of SALSA, we prove that the ranking of sites in the Stochastic approach may be calculated by examining the weighted in/out degrees of the sites in *G*. This result yields that SALSA is computationally lighter than the Mutual Reinforcement approach. We also discuss the reason for our success with analyzing weighted in/out degrees of sites, which previous work has claimed to be unsatisfactory for identifying authoritative sites.

The rest of the paper is organized as follows: Section 2 recounts Kleinberg's Mutual Reinforcement Approach. In section 3 we view Kleinberg's approach from a higher level, and define a meta-algorithm for link structure analysis. Section 4 presents our new approach, SALSA. In section 5 we compare the two approaches by considering their outputs on the WWW and on artificial topologies. Then, in section 6 we prove the connection between SALSA and weighted in/out degree rankings of sites. Our conclusions and ideas for future work are brought in section 7. The paper uses basic results from the theory of stochastic processes, which are brought in the full version. The main contribution of the paper can be grasped without following the full mathematical analysis.

### 2. Kleinberg's Mutual Reinforcement Approach

The Mutual Reinforcement approach ([16]) starts by assembling a collection ** C** of Web-sites, which should contain communities of hubs and authorities pertaining to a given topic

*t*. It then analyzes the link structure induced by that collection, in order to find the authoritative sites on topic

*t*.

Denote by *q* a term-based search query to which sites in our topic of interest *t* are deemed to be relevant. The collection** C** is assembled in the following manner:

- A
*root set**S*of sites is obtained by applying a term based search engine, such as AltaVista [8], to the query*q*. This is the only step in which the lexical content of the Web sites is examined. - From
*S*we derive a*base set*which consists of (a) sites in the root set*C**S*, (b) sites which point to a site in*S*, and (c) sites which are pointed to by a site in*S*. In order to obtain (b), we must again use a search engine. Many search engines store linkage information, and support queries such as "which sites point to [a given url]".

The collection ** C** and its link structure induce the following directed graph

*G*:

*G*'s nodes are the sites in

**, and for all**

*C**i , j*

**, the directed edge**

*C**i*

*j*appears in

*G*if and only if site

*i*contains a hyperlink to site

*j*. Let

*W*denote the |

**| × |**

*C**| adjacency matrix of*

**C***G*.

Each site *s*** C** is now assigned a pair of weights, a hub-weight

*h*(

*s*) and an authority weight

*a*(

*s*), based on the following two principles:

- The quality of a hub is determined by the quality of the authorities it points at. Specifically, a site's hub weight should be proportional to the sum of the authority weights of the sites it points at.
- "Authority lies in the eyes of the beholder(s)": A site is authoritative only if good hubs deem it as such. Hence, a site's authority weight is proportional to the sum of the hub-weights of the sites pointing at it.

The top ranking sites, according to both kinds of weights, form the Mutually Reinforcing communities of hubs and authorities. In order to assign such weights, Kleinberg uses the following iterative algorithm:

- Initialize
*a(s)*1 ,*h(s)*1 for all sites*s*.*C* - Repeat the following three operations until convergence:
- Update the authority weight of each site
*s*(theoperation):*I*

- Update the hub weight of each site
*s*(theoperation):*O*

- Normalize the authority weights and the hub weights.

- Update the authority weight of each site

Note that applying the ** I** operation is equivalent to assigning authority weights according to the result of multiplying the vector of all hub weights by the matrix

*W*

^{T}. The

**operation is equivalent to assigning hub weights according to the result of multiplying the vector of all authority weights by the matrix**

*O**W*.

Kleinberg showed that this algorithm converges, and that the resulting authority weights [hub weights] are the coordinates of the normalized principal eigenvector of *W*^{T}*W* [of *WW*^{T}] (the eigenvector which corresponds to the eigenvalue of highest magnitude of the matrix). *W*^{T}*W* and *WW*^{T} are well known matrices in the field of bibliometrics:

*A=W*^{T}*W*is the*co-citation matrix*([23]) of the collection. [*A*]_{i,j}is the number of sites which jointly point at (cite) pages*i*and*j*. Kleinberg's iterative algorithm converges to authority weights which correspond to the entries of the (unique, normalized) principal eigenvector of*A*.*H=WW*^{T}is the*bibliographic coupling matrix*([15]) of the collection. [*H*]_{i,j}is the number of sites jointly referred to (pointed at) by pages*i*and*j*. Kleinberg's iterative algorithm converges to hub weights which correspond to the entries of*H*'s (unique, normalized) principal eigenvector.

### 3. A Meta-Algorithm for Link Structure Analysis

Examining the Mutual Reinforcement approach from a higher level, we can identify a general framework, or meta-algorithm, for finding hubs and authorities by link-structure analysis. This meta-algorithm is a version of the spectral filtering method, presented in [6]:

- Given a topic
*t*, construct a site collectionwhich should contain many*C**t*-hubs and*t*-authorities, but should not contain many hubs or authorities for any other topic*t*'. Let*n =*||.**C** - Derive, from
and the link structure induced by it, two*C**n*×*n*association matrices - A*hub matrix**H*and an*authority matrix**A*. Association matrices are widely used in classification algorithms ([22]), and will be used here in order to classify the Web sites into communities of hubs/authorities. The association matrices which are used by the meta-algorithm will have the following algebraic property (let*M*denote such a matrix):*M*will have a unique real positive eigenvalue*µ(M)*of multiplicity 1, such that for any other eigenvalue*µ'*of*M*,*µ(M)*> |*µ'*|. Denote by*v*the (unique) unit eigenvector which corresponds to_{µ(M)}*µ(M)*whose first non-zero coordinate is positive.*v*will actually be a positive vector, and will be referred to as the_{µ(M)}*principal eigenvector*of*M*. - The sites that correspond to the largest coordinates of
*v*will form the_{µ(A)}*principal algebraic community of authorities*in, and the sites that correspond to the largest coordinates of*C**v*will form the_{µ(H)}*principal algebraic community of hubs*in.*C*

For the meta-algorithm to be useful, the algebraic principal communities of hubs and authorities should reflect the true authorities and hubs in *C*

The two degrees of freedom which the meta-algorithm allows, are the method for obtaining the collection, and the definition of the association matrices. Given a specific collection, the algebraic communities produced by the meta-algorithm are determined solely by the definition of the association matrices.

### 4. SALSA: Analyzing a Random Walk on the Web

In this section we introduce the *Stochastic Approach for Link Structure Analysis - SALSA*. The approach is based upon the theory of Markov chains, and relies on the stochastic properties of random walks performed on our collection of sites. It follows the meta-algorithm described in section 3, and differs from the Mutual Reinforcement approach in the manner in which the association matrices are defined.

The input to our scheme consists of a collection of sites ** C** which is built around a topic

*t*in the manner described in section 2. Intuition suggests that authoritative sites on topic

*t*should be visible from many sites in the subgraph induced by

**. Thus, a random walk on this subgraph will visit**

*C**t*-authorities with high probability.

We combine the theory of random walks with the notion of the two distinct types of Web sites, hubs and authorities, and actually analyze two different Markov chains: A chain of hubs and a chain of authorities. Unlike "conventional" random walks on graphs, state transitions in these chains are generated by traversing __ two__ WWW-links in a row, one link forward and one link backwards (or vise versa). Analyzing both chains allows our approach to give each Web site two distinct scores, a hub score and an authority score.

The idea of ranking Web sites using random walks is not new. The search engine *Google* ([2], [12]) incorporates stochastic information into its ranking of pages. The *PageRank* component of the search engine examines a __single__ random walk on the __entire WWW__. Hence, the ranking of Web sites in *Google* is independent of the search query (a global ranking), and no distinction is made between hubs and authorities.

Let us build a bipartite undirected graph *G'=(V _{h} ,V_{a} ,E)* from our site collection

**and its link structure:**

*C**V*= {_{h}*s*|_{h}*s*and*C**out*-*degree*(*s*) > 0 } (the*hub-side*of*G'*).*V*= {_{a}*s*|_{a}*s*and*C**in*-*degree*(*s*) > 0 } (the authority-side of*G'*).*E*= { (*s*|_{h}, r_{a})*s**r in*}**C**

*s*

**is represented by two nodes of**

*C**G'*,

*s*

_{h}and

*s*

_{a}. Each WWW-link

*s*

*r*is represented by an undirected edge connecting

*s*

_{h}and

*r*

_{a}.

On this bipartite graph we will perform two distinct random walks. Each walk will only visit nodes from one of the two sides of the graph, by traversing paths consisting of two *G'*-edges in each step. Since each edge crosses sides of *G'*, each walk is confined to just one of the graph's sides, and the two walks will naturally start off from different sides of *G'*. Note also that every path of length 2 in *G'* represents a traversal of one WWW link in the proper direction (when passing from the hub-side of *G'* to the authority-side), and a retreat along a WWW link (when crossing in the other direction). Since the hubs and authorities of topic *t* should be highly visible in *G'* (reachable from many nodes by either a direct edge or by short paths), we may expect that the *t*-authorities will be amongst the nodes most frequently visited by the random walk on *V*_{a}, and that the *t*-hubs will be amongst the nodes most frequently visited by the random walk on *V*_{h}.

We will examine the two different Markov chains which correspond to these random walks: The chain of the visits to the authority side of *G'* (the *authority chain*), and the chain of visits to the hub side of *G'*. Analyzing these chains separately naturally distinguishes between the two aspects of each site.

We now define two stochastic matrices, which are the transition matrices of the two Markov chains at interest:

*The hub-matrix*:

*The authority-matrix*:**Ã**

A positive transition probability ** ã_{i,j} > 0** implies that a certain page

*h*points to both pages

*i*and

*j*, and hence page

*j*is reachable from page

*i*by two steps: retracting along the link

*h*

*i*and then following the link

*h*

*j*.

Alternatively, the matrices and * Ã* can be defined as follows: Let

*W*be the adjacency matrix of the directed graph defined by

**and its link structure. Denote by**

*C**W*

_{r}the matrix which results by dividing each nonzero entry of

*W*by the sum of the entries in its row, and by

*W*

_{c}the matrix which results by dividing each nonzero element of

*W*by the sum of the entries in its column (Obviously, the sums of rows/columns which contain nonzero elements are greater than zero). Then consists of the non-zero rows and columns of

*W*

_{r}

*W*

_{c}

^{T}, and

*consists of the non-zero rows and columns of*

**Ã***W*

_{c}

^{T}

*W*

_{r}. We ignore the rows and columns of

*, which consist entirely of zeros, since (by definition) all the nodes of*

**Ã***G'*have at least one incident edge. The matrices

*and serve as the association matrices required by the meta-algorithm for identifying the authorities and hubs. Recall that the Mutual Reinforcement approach uses the association matrices*

**Ã***A=W*and

^{T}W*H=WW*.

^{T}We shall assume that *G'* is connected, causing both stochastic matrices * Ã* and to be irreducible. This assumption does not form a limiting factor, since when

*G'*is not connected, we may use our technique on each connected component separately. Section 6.1 further elaborates on the case when

*and have multiple irreducible components.*

**Ã**Some properties of and * Ã* :

- Both matrices are primitive, since the Markov chains which they represent are aperiodic: When visiting any authority(hub), there is a positive probability to revisit it on the next entry to the authority(hub) side of the bipartite graph (since all the nodes are non-isolated). Hence, every state (=site) in each of the chains has a self-loop, causing the chains to be aperiodic.
- The adjacency matrix of the support graph of
is symmetric, since*Ã*implies*ã*_{i,j}> 0. Furthermore,*ã*_{j,i}> 0if and only if*ã*0_{i,j}>*[W*> 0^{T}W]_{i,j}*W W*^{T}).

Following the framework of the meta-algorithm, the principal community of authorities(hubs) found by the SALSA will be composed of the sites whose entries in the principal eigenvector of * Ã* () are the highest. By the Ergodic Theorem ([9]), the principal eigenvector of an irreducible, aperiodic stochastic matrix is actually the stationary distribution of the underlying Markov chain, and its high entries correspond to sites most frequently visited by the (infinite) random walk.

### 5. Results

#### 5.1 The Tightly-Knit Community (TKC) Effect

A tightly-knit community is a small but highly interconnected set of sites. Roughly speaking, the *TKC effect* occurs when such a community scores high in link-analyzing algorithms, even though the sites in the TKC are not authoritative on the topic, or pertain to just one aspect of the topic. Our study indicates that the Mutual Reinforcement approach is vulnerable to this effect, and will sometimes rank the sites of a TKC in unjustified high positions.

As an example, consider a collection ** C** which contains the following two communities: A community

*y*, with a small number of hubs and authorities, in which every hub points to most of the authorities; and a much larger community

*z*, in which each hub points to a smaller part of the authorities. The topic covered by

*z*is the dominant topic of the collection, and is probably of wider interest on the WWW. Since there are many

*z*-authoritative sites, the hubs do not link to all of them, whereas the smaller

*y*community is densely interconnected. The TKC effect occurs when the sites of

*y*are ranked higher than those of

*z*.

In the full paper we provide a combinatorial construction, which demonstrates such (artificial) communities *y* and *z*, where the Mutual Reinforcement approach scores *y* higher than *z*, and the stochastic approach scores *z* higher. This bias of the Mutual Reinforcement approach towards tightly-knit communities will be demonstrated on WWW queries in the next section.

#### 5.2 The WWW

We tested the different approaches on broad-topic WWW queries (both single topic queries and multi-topic queries). We obtained a collection of sites for each query, and then derived the principal community of authorities with both approaches.Two of these queries ("java", "abortion") were used by Kleinberg in [16], and are brought here for the sake of comparison. All collections were assembled during February, 1999. The root sets were compiled using AltaVista ([8]), which also provided the linkage information needed for building the base sets.

When expanding the root set to the entire collection, we filtered the links pointing to and from Web sites. Following [16], we ignored intra-domain links (since these links tend to be navigational aids inside an intranet, and do not confer authority on the link's destination). We also ignored links to *cgi scripts*, and tried to identify ad-links and ignore them as well. Overall, 38% of the links we examined were ignored. The collections themselves turn out to be relatively sparse graphs, with the number of edges never exceeding three times the number of nodes. We note that a recent work by Kleinberg et al. ([17]) has examined some other connectivity characteristics of such collections.

For each query, we list the top authorities which were returned by the two approaches. The results are displayed in tables containing four columns:

- The url.
- The title of the url.
- The category of the url: (1) for a member of the root set, (2) for a site pointing into the root set, and (3) for a site pointed at by a member of the root set.
- The value of the coordinate of this url in the principal eigenvector of the authority matrix.

##### Single-Topic Query: Java

The results for this query, with our first example of the TKC effect, are shown in table 1. All of the top ten Mutual Reinforcement authorities are part of the EARTHWEB Inc. network. They are interconnected, but since the domain names of the sites are different, the interconnecting links were not filtered out. Some of the sites are highly relevant to the query (and have many incoming links from sites outside the EarthWeb net), but most appear in the principal community only because of their EarthWeb affiliation. With SALSA, only the top three Mutual Reinforcement authorities are retained, and the other seven are replaced by other authorities, some of which are clearly more related to the query.

Size of root size = 160, Size of collection = 2810 Principal Community, Mutual Reinforcement Approach:

url | title | cat | weight |

http://www.jars.com/ | EarthWeb's JARS.COM Java Review Service | (3) | 0.334102 |

http://www.gamelan.com/ | Gamelan - The Official Java Directory | (3) | 0.303624 |

http://www.javascripts.com/ | Javascripts.com - Welcome | (3) | 0.255254 |

http://www.datamation.com/ | EarthWeb's Datamation.com | (3) | 0.251379 |

http://www.roadcoders.com/ | Handheld Software Development@RoadCoders | (3) | 0.250816 |

http://www.earthweb.com/ | EarthWeb | (3) | 0.249373 |

http://www.earthwebdirect.com/ | Welcome to Earthweb Direct | (3) | 0.247467 |

http://www.itknowledge.com/ | ITKnowledge | (3) | 0.246874 |

http://www.intranetjournal.com/ | intranetjournal.com | (3) | 0.24518 |

http://www.javagoodies.com/ | Java Goodies JavaScript Repository | (3) | 0.238793 |

Principal Community, SALSA:

url | title | cat | weight |

http://java.sun.com/ | Java(tm) Technology Home Page | (3) | 0.365264 |

http://www.gamelan.com/ | Gamelan - The Official Java Directory | (3) | 0.36369 |

http://www.jars.com/ | EarthWeb's JARS.COM Java Review Service | (3) | 0.303862 |

http://www.javaworld.com/ | IDG's magazine for the Java community | (3) | 0.217269 |

http://www.yahoo.com/ | Yahoo! | (3) | 0.21412 |

http://www.javasoft.com/ | Java(tm) Technology Home Page | (3) | 0.203099 |

http://www.sun.com/ | Sun Microsystems | (3) | 0.187355 |

http://www.javascripts.com/ | Javascripts.com - Welcome | (3) | 0.138548 |

http://www.htmlgoodies.com/ | htmlgoodies.com - Home | (3) | 0.130676 |

http://javaboutique.internet.com/ | The Ultimate Java Applet Resource | (1) | 0.118081 |

###### Table 1: Authorities for WWW query "Java"

##### Single-Topic Query: movies

This query demonstrates the TKC effect in a most striking fashion on the WWW. First, consider the Mutual Reinforcement principal community of authorities, presented in table 2:

Size of root size = 175, Size of collection = 4539

url | title | cat | weight |

http://go.msn.com/npl/msnt.asp | MSN.COM | (3) | 0.167332 |

http://go.msn.com/bql/whitepages.asp | White Pages - msn.com | (3) | 0.167202 |

http://go.msn.com/bsl/webevents.asp | Web Events | (3) | 0.167202 |

http://go.msn.com/bql/maps.asp | Microsoft Expedia Maps-Home | (3) | 0.167202 |

###### Table 2:Mutual Reinforcement Authorities for WWW query "movies"

The top 30 authorities returned by the Mutual Reinforcement approach were all *go.msn.com* sites. All but the first received the exact same weight, 0.167202. Recall that we do not allow same-domain links in our collection, hence none of the top authorities was pointed at by a *go.msn.com* site. To understand how these sites scored so well, we turn to the principal community of hubs, shown in table 3:

url | title | cat | weight |

http://denver.sidewalk.com/movies | movies: denver.sidewalk | (1) | 0.169197 |

http://boston.sidewalk.com/movies | movies:boston.sidewalk | (1) | 0.169061 |

http://twincities.sidewalk.com/movies | movies: twincities.sidewalk | (1) | 0.1688 |

http://newyork.sidewalk.com/movies | movies: newyork.sidewalk | (1) | 0.168537 |

###### Table 3: Mutual Reinforcement Hubs for WWW query "movies"

These innocent looking hubs are all part of the *Microsoft Network (msn)*, but when building the basic set we did not identify them as such. All these hubs point, almost without exception, to the entire set of authorities found by the MR approach (hence the equal weights which the authorities exhibit). However, the vast majority of the sites in the collection were not part of this "conspiracy", and almost never pointed to any of the *go.msn.com* sites. Therefore, the authorities returned by the Stochastic approach (table 4) contain none of those *go.msn.com* sites, and are much more relevant to the query:

url | title | cat | weight |

http://us.imdb.com/ | The Internet Movie Database | (3) | 0.253333 |

http://www.mrshowbiz.com/ | Mr Showbiz | (3) | 0.22335 |

http://www.disney.com/ | Disney.com-The Web Site for Families | (3) | 0.22003 |

http://www.hollywood.com/ | Hollywood Online:...all about movies | (3) | 0.213355 |

http://www.imdb.com/ | The Internet Movie Database | (3) | 0.199987 |

http://www.paramount.com/ | Welcome to Paramount Pictures | (3) | 0.196682 |

http://www.mca.com/ | Universal Studios | (3) | 0.180021 |

###### Table 4: Stochastic authorities for WWW query "movies"

A similar community is obtained by the Mutual Reinforcement approach, after deleting the rows and columns which correspond to the top 30 authorities from the matrix *W*^{T}*W*. This deletion dissolves the *msn.com* community, and allows a community similar to the one obtained by SALSA to manifest itself.

##### Multi-Topic Query: abortion

This topic is highly polarized, with different cyber communities supporting pro-life and pro-choice views. In table 5, we bring the top 10 authorities, as determined by the two approaches:

Size of root size = 160, Size of collection = 1693 Principal Community, Mutual Reinforcement Approach:

url | title | cat | weight |

http://www.nrlc.org/ | National Right To Life | (3) | 0.420832 |

http://www.prolife.org/ultimate/ | The Ultimate Pro-Life Resource List | (3) | 0.316564 |

http://www.all.org/ | What's new at American Life League | (3) | 0.251506 |

http://www.hli.org/ | Human Life International | (3) | 0.212931 |

http://www.prolife.org/cpcs-online/ | Crisis Pregnancy Centers Online | (3) | 0.187707 |

http://www.ohiolife.org/ | Ohio Right to Life | (3) | 0.182076 |

http://www.rtl.org/ | Abortion, adoption and assisted-suicide
Information at Right to Life... |
(1) | 0.17943 |

http://www.bethany.org/ | Bethany Christian Services | (3) | 0.161359 |

http://www.ldi.org/ | abortion malpractice litigation | (1) | 0.140076 |

http://www.serve.com/fem4life/ | Feminists for Life of America | (3) | 0.122106 |

Principal Community, SALSA:

url | title | cat | weight |

http://www.nrlc.org/ | National Right To Life | (3) | 0.344029 |

http://www.prolife.org/ultimate/ | The Ultimate Pro-Life Resource List | (3) | 0.284714 |

http://www.naral.org/ | NARAL Choice for America | (3) | 0.240227 |

http://www.feminist.org/ | Feminist Majority Foundation | (3) | 0.186843 |

http://www.now.org/ | National Organization for Women | (3) | 0.177946 |

http://www.cais.com/agm/main/ | The Abortion Rights Activist | (1) | 0.166083 |

http://www.gynpages.com/ | Abortion Clinics Online | (3) | 0.163117 |

http://www.plannedparenthood.org/ | Planned Parenthood Federation | (3) | 0.157186 |

http://www.all.org/ | What's new at American Life League | (3) | 0.142357 |

http://www.hli.org/ | Human Life International | (3) | 0.142357 |

###### Table 5: Authorities for WWW query "Abortion"

All 10 top authorities found by the Mutual Reinforcement approach are pro-life resources, while the top 10 SALSA authorities are split, with 6 pro-choice sites and 4 pro-life sites (which are the same top 4 pro-life sites found by the Mutual Reinforcement approach). Again, we see the TKC effect: The Mutual Reinforcement approach ranks highly authorities on only one aspect of the query, while SALSA blends authorities from both aspects into its principal community.

##### Multi-Topic Query: genetics

This query is especially ambiguous in the WWW: It can be in the context of genetic engineering, genetic algorithms, or in the context of health issues and the human genome.

As in the *"abortion"* query, SALSA brings a diverse principal community, with authorities on the various contexts of the query, while the Mutual Reinforcement approach is focussed on one context (Genetic Algorithms, in this case). Both principal communities are shown in table 6:

Size of root size = 120, Size of collection = 2952 Principal Community, Mutual Reinforcement Approach:

url | title | cat | weight |

http://www.aic.nrl.navy.mil/galist/ | The Genetic Algorithms Archive | (3) | 0.27848 |

http://alife.santafe.edu/ | Artificial Life Online | (3) | 0.276159 |

http://www.yahoo.com/ | Yahoo! | (3) | 0.273599 |

http://www.geneticprogramming.com/ | The Genetic Programming Notebook | (1) | 0.25588 |

http://gal4.ge.uiuc.edu/illigal.home.html | illiGAL Home Page | (3) | 0.235717 |

http://www.cs.gmu.edu/research/gag/ | The Genetic Algorithms Group... | (3) | 0.201237 |

http://www.scs.carleton.ca/~csgs/
resources/gaal.html |
Genetic Algorithms and Artificial Life Resources | (1) | 0.181315 |

http://lancet.mit.edu/ga/ | GAlib: Matthew's Genetic Algorithms Library | (3) | 0.181157 |

Principal Community, SALSA:

url | title | cat | weight |

http://www.ncbi.nlm.nih.gov/ | The National Center for Biotechnology Information |
(3) | 0.250012 |

http://www.yahoo.com/ | Yahoo! | (3) | 0.227782 |

http://www.aic.nrl.navy.mil/galist/ | The Genetic Algorithms Archive | (3) | 0.223191 |

http://www.nih.gov/ | National Institute of Health (NIH) | (3) | 0.194688 |

http://gdbwww.gdb.org/ | The Genome Database | (3) | 0.177001 |

http://alife.santafe.edu/ | Artificial Life Online | (3) | 0.172383 |

http://www.genengnews.com/ | Genetic Engineering News (GEN) | (1) | 0.141617 |

http://gal4.ge.uiuc.edu/illigal.home.html | illiGAL Home Page | (3) | 0.13259 |

###### Table 6: Authorities for WWW query "genetic"

### 6. SALSA and the In/Out Degrees of Sites

In the previous sections we have presented the Stochastic approach as an alternative method for link-structure analysis, and have shown a few encouraging results obtained by it, as compared with the Mutual Reinforcement approach. We have also presented the TKC effect, a topological phenomenon which sometimes derails the MR approach and prevents it from converging to a useful community of authoritative sites.

The sample results shown so far have all been produced on unweighted collections, in which all informative links have received unit weight. Both approaches can produce better rankings when applied on weighted collections, in which each informative link receives a weight which reflects the amount of authority that the pointing site confers to the pointed site. Possible factors which may contribute to a link's weight include the following:

- Anchor text which is relevant to the query. Such text around a link heightens our confidence that the pointed site discusses the topic at hand ([7]).
- One of the link's endpoints being designated by the user as highly relevant to the search topic. When a site points to one of a small set of predefined authorities, it seems reasonable to raise the weights of other links which originate from that site. Similarly, when a site is known to be a good hub, it seems reasonable to assign high weights to its outgoing links. This approach has been recently applied in [5]. We coin it the
*anchor sites*approach, since it uses user-designated sites as anchors in the collection, around which the communities of hubs and authorities are grown. - The link's placement in the pointing page. Many search engines consider the text at the top of a page as more reflective of its contents than text further down the page. The same line of thought can be applied to the links which appear in a page, with the links which are closer to the top of the page receiving more weight than links appearing at the bottom of the page.

#### 6.1 Analysis of the Stochastic Ranking

We now prove a general result about the ranking produced by SALSA in weighted collections, for which some basic background in stochastic processes is assumed.

Let *G*=(*H*;*A*;*E*) be a positively weighted, directed bipartite graph with no isolated nodes, and let all edges be directed from sites in *H* to sites in *A*. We will use the following notations:

- The weighted in-degree of site
*i**A*: - The weighted out-degree of site
*k**H*: - The sum of edge weights:

Let *M*_{A} be a Markov chain whose states are the set *A* of vertices, with the following transition probabilities between every two states *i , j**A*:

Similarly, let *M*_{H} be a Markov chain whose states are the set *H* of vertices, with the following transition probabilities between every two states *k , l**H*:

Consider the following binary relation on the vertices of *A* (states of *M*_{A}):

It is not hard to show (and is shown in the full paper) that *R _{A}* is an equivalence relation on

*A*(similar arguments can be made concerning

*M*). This implies that all the states of

_{H}*M*are recurrent (none are transient). The equivalence classes of

_{A}*R*are the irreducible components of

_{A}*M*. We first deal with the case where

_{A}*R*consists of one equivalence class (i.e.,

_{A}*M*is irreducible).

_{A}**Proposition 1** Whenever *M _{A}* is an irreducible chain (has a single irreducible component), it has a unique stationary distribution

Similarly, whenever *M _{H}* is an irreducible chain, its unique stationary distribution ) satisfies:

*Proof:* We will prove the proposition for *M _{A}*. The proof for

*M*is similar.

_{H}By the Ergodic Theorem ([9]), any irreducible, aperiodic Markov chain has a unique stationary distribution vector. It will therefore suffice to show that the vector with the properties claimed in the proposition is indeed a stationary distribution vector of *M _{A}*.

- is a distribution vector: Its entries are non-negative, and their sum equals one.
- is a stationary distribution vector of
*M*. Here we need to show the equality :_{A}

Thus, when the (undirected) support graph of *G* is connected, SALSA assigns each site an authority weight which is proportional to the sum of weights of its incoming edges. The hub weight of each site is proportional to the sum of weights of its outgoing edges. In unweighted collections (with all edges having unit weight), each site's Stochastic authority(hub) weight is simply proportional to the in(out) degree of the site.

This mathematical analysis, in addition to providing insight about the ranking that is produced by SALSA, also suggests a very simple algorithm for calculating the Stochastic ranking: Simply calculate, for all sites, the sum of weights on their incoming(outgoing) edges, and normalize these two vectors. There is no need to apply any resource-consuming iterative method to approximate the principal eigenvector of the transition matrix of the Markov chain.

##### Markov chains with multiple irreducible components

Consider the case in which the authority chain *M _{A}* consists of multiple irreducible components. Denote these (pairwise disjoint) components by , where . What will be the outcome of a random walk performed on the set of states

*A*according to the transition matrix

*P*? To answer this question, we will need some notations:

_{A}- Let
*e*denote the |*A*|-dimensional distribution vector, all whose entries equal . - For all vertices
*j**A*, denote by*c*(*j*) the irreducible component (equivalence class of*R*) to which_{A}*j*belongs: . - Let be the unique stationary distributions of the (irreducible) Markov chains induced by .
- Denote by the entry which corresponds to
*j*in (the stationary distribution of*j*'s irreducible component,*A*)._{c(j)}

**Proposition 2** The random walk on *A*, governed by the transition matrix *P _{A}* and started from all states with equal probability, will converge to a stationary distribution as follows:

*Proof:* Denote by the probability of being in a site belonging to *A _{i}* after the

*n'th*step of the random walk. This probability is determined by the distribution vector

*eP*. Clearly,

_{A}^{n}Since the transition probability between any two sites (states) which belong to different irreducible components is zero, *p _{i}^{n}*=

*p*for all

_{i}^{0}*n*(probability does not shift from one component to another). Inside each irreducible component the Ergodic Theorem holds, thus the probabilities which correspond to the sites of

*A*in will be proportional to , and the proposition follows.

_{i}This proposition points out a natural way to compare the authoritativeness of sites from different irreducible components: Simply multiply each site's authority score by the normalized size of the irreducible component to which it belongs. We do not claim that this is in any way optimal, as very small irreducible components should be trimmed from the graph altogether. But the underlying principle is important: Consider the size of the community when evaluating the quality of the top sites in that community. The budget which the Mayor of New York City controls is much larger than that of the Mayor of Osh Kosh, Wisconsin.

It is this combination of a site's intra-community authority score and its community's size that allows the Stochastic approach to blend authorities from different aspects of a multi-topic query, and which reduces its vulnerability to the TKC effect.

#### 6.2 In-Degree as a Measure of Authority (Revisited)

Extensive research in link-structure analysis has been conducted in recent years under the premise that considering the in-degree of sites as a sole measure of their authority does not produce satisfying results. Kleinberg, as a motivation to the Mutual Reinforcement approach, showed some examples of the inadequacy of a simple in-degree ranking ([16]). Our results in section 5.2 seem to contradict this premise: The Stochastic rankings seem quite satisfactory there, and since those collections were unweighted, the Stochastic rankings are equivalent to simple in-degree counts (normalized by the size of the connected component which each site belongs to). To gain more perspective on this apparent contradiction, let us elaborate on the first stage of the meta-algorithm for link-structure analysis (from section 3), in which the graph to be analyzed is assembled:

- Given a query, assemble a collection of Web-sites which should contain many hubs and authorities pertaining to the query, and few hubs and authorities for any particular unrelated topic.
- Filter out non-informative links connecting sites in the collection.
- Assign weights to all non-filtered links. These weights should reflect the information conveyed by the link.

It is only after these steps that the weighted, directed graph is analyzed and the rankings of hubs and authorities are produced. The analysis of the graph, however important, is just the second stage in the meta-algorithm, and the steps involved in the first stage are crucial to the success of the entire algorithm.

Considerable research efforts have been invested in improving the quality of the assembled graphs. The current state of the art techniques for these steps is now such that in many cases, simple (and efficient) algorithms and heuristics produce quite satisfying results on the assembled graphs.

It is important to keep in mind the main goal of broad-topic WWW searches, which is to enhance the precision at 10 of the results, not to rank the entire collection of sites correctly. It is entirely irrelevant if the site in place 98 is really better than the site in place 216. The Stochastic ranking, which turns out to be equivalent to a weighted in-degree ranking, discovers the most authoritative sites quite effectively (and very efficiently) in many (carefully assembled) collections. No claim is made on the quality of its ranking on the rest of the sites (which constitute the vast majority of the collection).

### 7. Conclusions

We have developed a new approach for finding hubs and authorities, which we call *SALSA* - The Stochastic Approach for Link Structure Analysis. SALSA examines random walks on two different Markov chains which are derived from the link structure of the WWW: The authority chain and the hub chain. The principal community of authorities (hubs) corresponds to the sites that are most frequently visited by the random walk defined by the authority (hub) Markov chain. SALSA and Kleinberg's Mutual Reinforcement approach are both in the framework of the same meta-algorithm.

We have shown that the ranking produced by SALSA is equivalent to a weighted in/out-degree ranking (with the sizes of irreducible components also playing a part). This makes SALSA computationally lighter than the Mutual Reinforcement approach.

Both approaches were tested on the WWW, where SALSA appears to compare well with the Mutual Reinforcement approach. These tests, as well as analytical work, have revealed a topological phenomenon on the Web called the TKC effect. This effect sometimes derails the Mutual Reinforcement approach, and prevents it from finding relevant authoritative sites (or from finding authorities on all meanings/aspects of the query):

- In multi-topic collections, the principal community of authorities found by the Mutual Reinforcement approach tends to pertain to only one of the topics in the collection.
- In single topic collections, the TKC effect sometimes results in the Mutual Reinforcement approach ranking many irrelevant sites as authorities.

We note that SALSA is less vulnerable to the TKC effect, and produces good results in many cases where the Mutual Reinforcement approach fails to do so.

##### The following issues are left for future research:

- In collections with many connected components, we have studied one manner in which to combine the inner-component authority score with the size of the component. There may be better ways to combine these two factors into a single score.
- We have found a simple property of the Stochastic ranking, which enables us to compute this ranking without the need to approximate the principal eigenvector of the stochastic matrix which defines the random walk. Is there some simple property which will allow us to calculate the Mutual Reinforcement ranking without approximating the principal eigenvector of
*W*^{T}*W*? If not, can we alter the graph*G*in some simple manner (for instance, change some weights on the edges) so that the Stochastic ranking on the modified graph will be approximately equal to the Mutual Reinforcement ranking on the original graph?

### Acknowledgments

The second author would like to thank Udi Manber for introducing him to the search problems studied in this paper, and Udi Manber and Toni Pitassi for delightful and interesting discussions at the early stages of this research.

### Bibliography

- J. Gary Auguston and Jack Minker. An analysis of some graph theoretical cluster techniques. JACM, 17:4(1970) 571-588 .
- Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Proc. 7th International WWW Conference, 1998.
- Jeromy Carrière and Rick Kazman. Webquery: Searching and visualizing the web through connectivity. Proc. 6th International WWW Conference, 1997.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Hypersearching the web. Scientific American, June 1999.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Mining the web's link structure. IEEE Computer, August 1999.
- S. Chakrabarti, B. Dom, D. Gibson, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Spectral filtering for resource discovery. ACM SIGIR workshop on Hypertext Information Retrieval on the Web, 1998.
- Soumen Chakrabarti, Byron Dom, David Gibson, Jon M. Kleinberg, Prabhakar Raghavan, and Sridhar Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. Proc. 7th International WWW Conference, 1998.
- Compaq Computer Corporation. Altavista net guide.http://www.altavista.com/.
- Robert G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, 1996.
- E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178(1972) 471-479 .
- David Gibson, Jon M. Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.
- Google Inc. Google search engine. http://www.google.com/.
- Monika R. Henzinger and Krishna Bharat. Improved algorithms for topic distillation in a hyperlinked environment. Proceedings of the 21'st International ACM SIGIR Conference on Research and Development in IR, August 1998.
- IBM Corporation Almaden Research Center. Clever. http://www.almaden.ibm.com/cs/k53/clever.html.
- M.M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1963) 10-25 .
- Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
- Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew S. Tomkins. The web as a graph: Measurements, models and methods. Proceedings of the Fifth International Computing and Combinatorics Conference, 1999.
- Ken Law, Thomas Tong, and Alan Wong. Automatic categorization based on link structure, Stanford University, 1999.
- Massimo Marchiori. The quest for correct information on the web: Hyper search engines. Proc. 6th International WWW Conference, 1997.
- Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. Preliminary version appeared in PODS 98, pp. 159-168 .
- Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow's ear: Extracting usable structures from the web. Proc. ACM SIGCHI Conference on Human Factors in Computing, 1996.
- C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979.
- H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. American Soc. Info. Sci., 24(1973) 265-269 .
- R. Weiss, B. Vélez, M. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and D. Gifford. Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. Proc. 7th ACM Conference on Hypertext, 1996.

#### Vitae

- Ronny Lempel is a Ph.D. Student in the Department of Computer Science, Technion, Haifa, Israel, focusing on WWW link-structure analysis. He received his B.Sc. and M.Sc. from the same department in 1997 and 1999, respectively.
- Shlomo Moran received his BSc and DSc degrees in mathematics from the Technion, in 1975 and 1979 resp. Since 1981 he is a faculty member in the Computer Science Department in the Technion, where he is now a Professor and Chairman. His current research interests include communication in high speed networks, exact communication complexity of distributed tasks, confidentiality protection in medical records, and search methods in the internet.

### About this document ...

**The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect**

This document was generated using the **LaTeX**2`HTML` translator Version 98.1p7 (June 18th, 1998)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

Copyright © 1997, 1998, Ross Moore, Mathematics Department, Macquarie University, Sydney.