Current HITS-based Algorithms and their Limitations nextupprevious
Next:A New Weighted HITS-basedUp:Improvement of HITS-based AlgorithmsPrevious:Introduction

Current HITS-based Algorithms and their Limitations

In the context of Web search, a HITS-based algorithm first collects a base document set for each query. Then it recursively calculates the hub and authority values for each document. To gather the base document set $I$, first, a root set $R$ that matching the query is fetched from a search engine; then, for each document $r \in R$, a set of documents $L$ that point to $r$ and another set of documents $L'$ that are pointed to by$r$ are added to the set $I$ as $R$'s neighborhood. For a document $i \in I$, let $a_i$ and $h_i$ be the authority and hub values respectively. To begin the algorithm, $a_i$ and $h_i$ are initialized to 1. While the values have not converged, the algorithm iteratively proceeds as follows:
  1. For all $i' \in I$ which points to $i$,
  2.  
    \begin{displaymath}a_i = \sum_{i'} h_{i'}\end{displaymath} (1)
  3. For all $i' \in I$ which is pointed to by $i$,
  4.  
    \begin{displaymath}h_i = \sum_{i'} a_{i'}\end{displaymath} (2)
  5. Normalize $a_i$ and $h_i$ values so that $\sum_{i} a_i=\sum_{i} h_i =1$.
Kleinberg showed that the algorithm will eventually converge, but the bound on the number of iterations is unknown. In practice, the algorithm converges quickly.

Bharat [1] improved Kleinberg's HITS algorithm by giving a document an authority weight of $1/k$ if the document is in a group of $k$ documents on a first host which link to a single document on a second host, and a hub weight of $1/l$ if there are $l$ links from the document on a first host to a set of documents on a second host. In the following discussion, we denote Bharat's improved HITS algorithm as BHITS algorithm.
 
 

Table 1: Top 10 authorities and top 10 hubs obtained by BHITS algorithm and WBHITS algorithm for query cruises, respectively. We ignore the prefix ``http://'' for each link.
Top 10 Auth. by BHITS Top 10 hubs by BHITS
www.cyprus-property.net www.cyprus-cruises.com
www.cyprus-hotels.com cyprus-car-hire.com
www.windowoncyprus.com cruisecyprus.com
www.cyprus-villas.com www.sunnycyprus.com/links/transport.html
cruisecyprus.com www.armata.net/banner_advertising_opportunities.htm
www.fly-2-cyprus.com www.caribbean-media.com/amex_why.htm
cyprus-car-hire.com www.cyprus-hotels.com/ayia_napa.htm
www.cyplon.com www.cyprus-holidays.com/cyprus.htm
www.benacon.com windowsoncyprus.com/contacts_for_window_on_cyprus.htm
www.yacht-sale.com sybar.com/d/Autos/autos.html
Top 10 Auth. by WBHITS Top 10 Hubs by WBHITS
www.costacruises.com www.wxusa.com/Travel
www.rssc.com www.cruiseplanners-acruz4u.com/cruises.html
www.cruisehawaii.com www.cruisesonly.net/CLIAlinks.htm
www.regalcruises.com www.anchorsaweigh.com/destinations.html
www.princesscruises.com stablescruise.cruisesonly.net/CLIAlinks.htm
www.first-european.com www.cruisespecials.com/newlinkspage.htm
www.silversea.com www.caribbean-on-line.com/cruise-lines
www.princess.com www.justcruising.org/cruise_lines.htm
www.royalolympiccruises.com www.all-global.com/cruiselineslinks.htm
www.rivercruises.com www.tradewinds-travel.net/cruises.htm

Although BHITS generally works well, there are cases where it generates bad results. Let's use a simple query cruises to illustrate the problem. The top half of Table 1 shows the top 10 authorities and top 10 hubs returned by BHITS. All the top 10 authorities are from the out-links of a root link www.cyprus-cruises.com, which has only 1 in-link but 271 out-links, most of which are in different domains, while the average in-degree and out-degree of a root set link are 34 and 17 respectively. Top 3 to top 10 hubs are from another root link cyprus-car-hire.com, which is also an out-link of the root link www.cyprus-cruises.com. Actually, the two root links and their neighborhood form a tightly-knit community (TKC) [18]. After several iterations of BHITS algorithm, the hub value of the root link www.cyprus-cruises.com dominates all the others, as a result, the documents it points to get the largest authority values. Consequently, the in-links of root link cyprus-car-hire.com get the top hub values because it is assigned the largest authority value. After we manually visited these top 10 authorities and top 10 hubs, we found most of the links are of very low relevance to the query cruises.

The major problem of the above example is that the root link www.cyprus-cruises.com has few in-links but a large number of out-links, most of which are not very relevant to the query. Let's call such a root link small-in-large-out link. Usually, the average in-degree and out-degree of a root link are much smaller than the out-degree of a small-in-large-out link.


nextupprevious
Next:A New Weighted HITS-basedUp:Improvement of HITS-based AlgorithmsPrevious:Introduction
2002-02-18