Next:A New Weighted HITS-basedUp:Improvement of HITS-based AlgorithmsPrevious:Introduction
Current HITS-based Algorithms and their LimitationsIn the context of Web search, a HITS-based algorithm first collects a base document set for each query. Then it recursively calculates the hub and authority values for each document. To gather the base document set , first, a root set that matching the query is fetched from a search engine; then, for each document , a set of documents that point to and another set of documents that are pointed to by are added to the set as 's neighborhood. For a document , let and be the authority and hub values respectively. To begin the algorithm, and are initialized to 1. While the values have not converged, the algorithm iteratively proceeds as follows:
- For all which points to ,
- For all which is pointed to by ,
- Normalize and values so that .
Bharat  improved Kleinberg's HITS
algorithm by giving a document an authority weight of
if the document is in a group of
documents on a first host which link to a single document on a second host,
and a hub weight of
if there are
links from the document on a first host to a set of documents on a second
host. In the following discussion, we denote Bharat's improved HITS algorithm
as BHITS algorithm.
|Top 10 Auth. by BHITS||Top 10 hubs by BHITS|
|Top 10 Auth. by WBHITS||Top 10 Hubs by WBHITS|
Although BHITS generally works well, there are cases where it generates bad results. Let's use a simple query cruises to illustrate the problem. The top half of Table 1 shows the top 10 authorities and top 10 hubs returned by BHITS. All the top 10 authorities are from the out-links of a root link www.cyprus-cruises.com, which has only 1 in-link but 271 out-links, most of which are in different domains, while the average in-degree and out-degree of a root set link are 34 and 17 respectively. Top 3 to top 10 hubs are from another root link cyprus-car-hire.com, which is also an out-link of the root link www.cyprus-cruises.com. Actually, the two root links and their neighborhood form a tightly-knit community (TKC) . After several iterations of BHITS algorithm, the hub value of the root link www.cyprus-cruises.com dominates all the others, as a result, the documents it points to get the largest authority values. Consequently, the in-links of root link cyprus-car-hire.com get the top hub values because it is assigned the largest authority value. After we manually visited these top 10 authorities and top 10 hubs, we found most of the links are of very low relevance to the query cruises.
The major problem of the above example is that the root link www.cyprus-cruises.com has few in-links but a large number of out-links, most of which are not very relevant to the query. Let's call such a root link small-in-large-out link. Usually, the average in-degree and out-degree of a root link are much smaller than the out-degree of a small-in-large-out link.
Next:A New Weighted HITS-basedUp:Improvement of HITS-based AlgorithmsPrevious:Introduction 2002-02-18