Web Page Scoring Systems
for Horizontal and Vertical Search
Dipartimento di Ingegneria dell'Informazione
Via Roma 56 - Siena, Italy
Dipartimento di Ingegneria dell'Informazione
Via Roma 56 - Siena, Italy
Dipartimento di Ingegneria dell'Informazione
Via Roma 56 - Siena, Italy
Copyright is held by the author/owner(s).
WWW2002, May 7-11, 2002, Honolulu, Hawaii, USA.
In this paper, we propose a general framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. Finally, experimental results are given to assess the features of the proposed scoring systems with special emphasis on vertical search.
Categories and Subject DescriptorsF.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems - Sorting and Searching; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - Information Filtering; H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia
KeywordsWeb Page Scoring Systems, Random Walks, HITS, PageRank, Focused PageRank 1] can significantly increase the capability of search engines. A simple counting of the number of references does not take into account the fact that not all the citations have the same authority. PageRank is a noticeable example of a topological-based ranking criterion. An interesting example of query-dependent criteria is given in . User queries are issued to a search engine in order to create a set of seed pages. Crawling the Web forward and backward from that seed is performed to mirror the Web portion containing the information which is likely to be useful. A ranking criterion based on topological analyses can be applied to the pages belonging to the selected Web portion. Very interesting results in this direction have been proposed in [4,5,6]. In  a Bayesian approach is used to compute hub and authorities, whereas in  both topological information and information about page content are included in distillation of information sources performed by a Bayesian approach.
Generally speaking, the ranking of hypertextual documents is expected to take into account the reputation of the source, the page updating frequency, the popularity, the speed access, the degree of authority, and the degree of hubness.
The page rank of hyper-textual documents can be thought of as a function of the document content and the hyperlinks. In this paper, we propose a general framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. The general web page scoring model proposed in this paper extends both PageRank  and the HITS scheme . In addition, the proposed model exhibits a number of novel features, which turn out to be very useful especially for focused (vertical) search. The content of the pages is combined with the web graphical structure giving rise to scoring mechanisms which which are focused on a specific topic. Moreover, in the proposed model, vertical search schemes can take into account the mutual relationship amongst different topics. In so doing, the discovery of pages with high score for a given topic affects the score of pages with related topics.
Experimental results were carried out to assess the features of the proposed scoring systems with special emphasis on vertical search. The very promising experimental results reported in the paper provide a clear validation of the proposed general scheme for web page scoring systems.
2,6]. The Web is represented as a graph , where each Web page is a node and a link between two nodes represents a hyperlink between the associated pages. A common assumption is that the relevance of page is represented by the probability of ending up in that page during a walk on this graph.
In the general framework we propose, we consider a complete model of the behavior of a user surfing the Web. We assume that a Web surfer can perform one out of four atomic actions at each step of his/her traversal of the Web graph:
- jump to a node of the graph;
- follow a hyperlink from the current page;
- follow a back-link (a hyperlink in the inverse direction);
- stay in the same node.
We assume that the behavior of the surfer depends on the page he is currently visiting. The action he/she will decide to take will depend on the page contents and the links it contains. For example if the current page is interesting to the surfer, it is likely he/she will follow a hyperlink contained in the page. Whereas, if the page is not interesting, the surfer will likely jump to a different page not linked by the current one. We can model the user behavior by a set of probabilities which depend on the current page:
- the probability of following one hyperlink from page ,
- the probability of following one back-link from page ,
- the probability of jumping from page ,
- the probability of remaining in page .
Most of these actions need to specify their targets. Assuming that the surfer's behavior is time-invariant, then we can model the targets for jumps, hyperlink or back-link choices by using the following parameters:
- the probability of jumping from page to page ;
- the probability of selecting a hyperlink from page to page ; this value is not null only for the pages linked directly by page , i.e. , being the set of the children of node in the graph ;
- the probability of going back from page to page ; this value is not null only for the pages which link directly page , i.e. , being the set of the parents of node in the graph .
These sets of values must satisfy the following probability normalization constraints for each page : , , .
The model considers a temporal sequence of actions performed by the surfer
and it can be used to compute the probability that the surfer
is located in page at time , . The probability
distribution on the pages of the Web is updated by taking into
account the possible actions at time using the following
These probabilities can be collected in a -dimensional vector , being the number of pages in the Web graph , and the probability update equations can be rewritten in a matrix form. The probabilities of moving from a page to a page given an action can be organized into the following matrices:
- the forward matrix whose element is the probability ;
- the backward matrix collecting the probabilities ;
- the jump matrix which is defined by the jump probabilities .
Further, we can define the set of action matrices which collect the probabilities of taking one of the possible actions from a given page . These are diagonal matrices defined as follows: whose diagonal values are the probabilities , collecting the probabilities , containing the values , and having on the diagonal the probabilities .
Hence, equation (2) can be written in
matrix form as
The transition matrix used to update the probability distribution is
3) can be written as
5) can be applied recursively to compute the probability distribution at a given time step yielding
In order to define an absolute page rank for the pages on the Web, we consider the stationary distribution of the Markov chain defined by the previous equations. is the state transition matrix of the Markov chain. is stable, since is a stochastic matrix having its maximum eigenvalue equal to . Since the state vector evolves following the equation of a Markov Chain, it is guaranteed that if , then .
By applying the results on Markov chains (see e.g. ), we can prove the following proposition.
Moreover, we suppose that also the probability of choosing a
jump among the available actions does not depend on the page,
. Under these two assumptions,
equation (3) becomes
If we define
the solution of equation (7) is
because . Hence, in equation (8) the first term vanishes as , i.e.
9), it can be proven that the previous equation converges to
2.1, the value of does not depend on the choice of the initial state vector .
3,6], where a pair of variables are used to represents the concepts of hubness and authority of a page. In the probabilistic framework described so far, we can define a multivariable scheme by considering a pool of Web surfers each described by a single variable. Each surfer is characterized by his/her own way of browsing the Web modeled by using different parameter values in each state transition equation. By choosing proper values for these parameters we can choose different policies in evaluating the absolute importance of the pages. Moreover, the surfers may interact by accepting suggestions from each other.
In order to model the activity of different surfers, we use a set of state variables which represent the probability of each surfer to be visiting page at time . The interaction among the surfers is modeled by a set of parameters which define the probability of the surfer to accept the suggestion of the surfer , thus jumping from the page he/she is visiting to the one visited by the surfer . This interaction happens before the choice of the actions described previously. If we hypothesize that the interaction does not depend on the page the surfer is currently visiting, the degree of interaction with the surfer is modeled by the value which represents the probability for the surfer of jumping to the page visited by the surfer . These values must satisfy the probability normalization constraint .
As an example, suppose that there are two surfers, ``the novice'' and ``the expert''. Surfer , the novice, blindly trusts the suggestions of surfer as he/she believes is an expert in discovering authoritative pages, whereas does not trust at all his/her own capabilities. In this case the complete dependence of the novice on the expert is modeled by choosing and , i.e. the novice chooses the page visited by the expert with probability equal to 1.
Before taking any action, the surfer repositions himself/herself
in page with probability
looking at the
suggestions of the other surfers. This probability is computed
Thus, when computing the new probability distribution
due to the action taken at time by the surfer , we consider
applying equation (2). Thus, the transition
function is defined as follows
When considering surfers, the score vectors of each surfer can be collected as the columns of a matrix . Moreover, we define the matrix which collects the values . The matrix will be referred to as the interaction matrix. Finally, each surfer will be described by his/her own forward, backward and jump matrices , , and by the action matrices , , , . Thus, the transition matrix for the Markov chain associated to the surfer is .
Using these definitions, the set of interacting surfers can be described
by rewriting equation (14)
as a set of matrix equations as follows
When the surfers are independent on each other (i.e. , and ), the model reduces to models as described by equation (6).
2), by choosing its parameters as follows. First, the probabilities of following a back-link and of remaining in any page are null for all the pages . Then, as stated above, the probability of performing a random jump is for any page , whereas the probability of following a hyperlink contained in the page is also a constant, i.e. . Given that a jump is taken, its target is selected using a uniform probability distribution over all the Web pages, i.e. . Finally, the probability of following the hyperlink from to does not depend on the page , i.e. . In order to meet the normalization constraint, where is the number of links exiting from page (the page hubness). This assumption makes the surfer random, since we define a uniform probability distribution among all the outgoing links.
This last hypothesis cannot be met by pages which do not contain any links to other pages. A page with no out-links is called sink page, since it would behave just like a score sink in the PageRank propagation scheme. In order to keep the probabilistic interpretation of PageRank, all sink nodes must be removed. The page rank of sinks is then computed from the page ranks of their parents.
Under all these assumptions equation (2) can be rewritten as
and, then, equation (16) becomes
Since , for each page and then equation (12) guarantees that the Google's PageRank converges to a distribution of page scores that does not depend on the initial distribution.
In order to apply the Google's PageRank scheme without removing
the sink nodes, we can introduce the following modification
to the original equations. Since no links can be followed from
a sink node , must be equal to
and equal to . Thus, when there are sinks,
is defined as
3]. In this formulation the Web pages are divided into two page classes: pages which are information sources (authorities) and pages which link to information sources (hubs). The HITS algorithm assigns two numbers to each page , the page authority and the page hubness in order to model the importance of the page. These values are computed by applying iteratively the following equations
The HITS ranking scheme can be represented in the general Web surfer framework, even if some of the assumptions will violate the probabilistic interpretation. Since HITS uses two state variables, the hubness and the authority of a page, the corresponding random walk model is a multiple state scheme based on the activity of two surfers. Surfer 1 is associated to the hubness of pages whereas surfer 2 is associated to the authority of pages. For both surfers the probabilities of remaining in the same page and of jumping to a random page are null. Surfer 1 never follows a link, i.e. whereas he/she always follows a back-link, i.e. . Because of this, the HITS computation violates the probability normalization constraints, since .
Surfer 2 has the opposite behavior with respect to surfer 1. He/she always follows a link, i.e. , and he/she never follows a back-link, i.e. . In this case the normalization constraint is violated for the values of because .
Under these assumptions , being the identity matrix, whereas , , , , , are all equal to the null matrix.
The interaction between the surfers is described by the matrix:
In this case equation
The HITS model violates the probabilistic interpretation and this makes the computation unstable, since the matrix has a principal eigenvalue much larger then . Hence, unlike Google's PageRank, the HITS algorithm needs the score to be normalized at the end of each iteration.
Finally, the HITS scheme suffers from other drawbacks. In particular, large highly connected communities of Web pages tend to attract the principal eigenvector of , thus pushing to zero the relevance of all other pages. As a result the page scores tend to decrease rapidly to zero for pages outside those communities. Recently some heuristics have been proposed to avoid this problem even if such behavior can not be generally avoided because of the properties of the dynamic system associated to the HITS algorithm .
Like in HITS, the interaction between the surfers is described by the matrix
In this case, equation
Further, we assume the independence of parameters
on the page . Hence, it holds that
is the diagonal matrix with element
is the diagonal matrix with element
equal to . Then:
In particular, setting yields a normalized version of HITS, which has been proposed in .
If the surfer is located at page and the pages linked by page
by a topic
classifier, the probability of the surfer to follow the -th link
is defined as
Hence, the modified equation to compute the combined page
scores using a PageRank-like scheme is
This scoring system removes the assumption of complete randomness of the underlying Web surfer. In this case, the surfer is aware of what he/she is searching, and he/she will trust the classifier suggestions following the links with a probability proportional to the score of the page the links leads to. This allows us to derive a topic-specific page rank. For example: the ``Microsoft'' home is highly authoritative according to the topic-generic PageRank, whereas it is not highly authoritative when searching for ``Perl'' language tutorials, since even if that page gets many citations, most of these citations will be scarcely related to the target topic and then not significantly considered in the computation.
We can model this behavior by adapting the action probabilities
using the contents of the current page, thus modeling
a focused choice of the surfer's actions.
In particular, the probability of following a hyperlink
can be chosen to be proportional to the degree of relevance of the current
page with respect to the target, i.e.
On the other hand, the probability of jumping away from a page decreases
proportionally to , i.e.
Such modifications can be integrated into the focused PageRank proposed in section 4.1 to model a focused navigation more accurately. Equation (12) guarantees that the resulting scoring system is stable and that it converges to a score distribution independent from the initial distribution.
Using the focus crawler described in , we have performed two focus crawling sessions, downloading 150.000 pages for each single crawl. During the first session the crawler spidered the Web searching for pages on the topic ``Linux''. During the second session the crawler gathered pages on ``cooking recipes''. Each downloaded page was classified to assess its relevance with respect to the specific topic. Considering the hyperlinks contained in each page, two Web subgraphs were created to perform the evaluation of the different WPSSs proposed in the previous sections. For the second crawling session, the connectivity map of pages was pruned removing all links from a page to pages in the same site in order to reduce the ``nepotism'' of Web pages.
The topological structure of the graphs and the scores of the text classifiers were used to evaluate the following WPSSs:
- the ``In-link'' surfer. Such surfer is located in pages with probability where is the relevance of the page computed by the text classifier;
- the PageRank surfer;
- the Focused PageRank scheme as described in section 4.1;
- the Double Focused PageRank scheme described in section 4.2;
- the HITS surfer pool;
- the PageRank-HITS surfer pool.
1 reports the plots for the two categories, ``Linux'' and ``cooking recipes''. In both cases the HITS surfer assigns a score value significantly greater than zero only to the small set of pages associated to the main eigenvector of the connectivity matrix of the analyzed portion of the Web. On the other hand the PageRank surfer is more stable and its score distribution curve is smooth. This is the effect of the homogeneous term in equation (17) and of the stability in the computation provided by the probabilistic interpretation. The Focused PageRank surfer and the Double Focused one still provide a smooth distribution. However, the focused page ranks are more concentrated around the origin. This reflects the fact that the vertical WPSSs are able to discriminate the authorities on the specific topic, whereas the classical PageRank scheme considers the authoritative pages regardless their topic.
2 and 3 show, respectively, the 8 top score pages for four different WPSSs on the database with pages on topic ``Linux'', while figures 4 and 5 reports the same results for the topic ``cooking recipes''.
As shown in figure 2, all pages distilled by the HITS algorithm come from the same site. In fact, a site which has many internal connections may acquire the principal eigenvector of the connectivity matrix associated to the considered portion of the Web graph, conquering all the top positions in the page rank and hiding all other resources.
Even the elimination of intra-site links does not improve the performances of the HITS algorithm. For example, as shown in the HITS section of figure 4, the Web site ``www.allrecipe.com'', which is subdivided into a collection of Web sites (``www.seafoodrecipes.com'', ``www.cookierecipes.com'', etc.) strongly connected among them, occupies all the top positions in the ranking list, hiding all other resources. In  the content of pages is considered in order to propagate relevance scores only over the subset of links pointing to pages on a specific topic. In the ``cooking recipe'' case, the performances cannot be improved even using page content, since all the considered sites are effectively on the topic ``cooking recipes'', and then there is a semantic reason because such sites are connected. We claim that such behavior is intrinsic of the HITS model.
The PageRank algorithm is not topic dependent. Since some pages are referred to by many Web pages independently from their content, such pages result in always being authoritative, regardless the topic of interest. For example, in figures 2 and 4 pages like ``www.yahoo.com'', ``www.google.com'', etc, are shown in the top list even if they are not closely related to the specific topic. It is shown in figures 3 and 5 that the ``Focused PageRank'' WPSS described in section 4.1 can filter many off-topic authoritative pages from the top list. Finally, the ``Double Focused PageRank'' WPSS is even more effective in filtering all the off-topic authorities, while pushing all the authorities on the relevant topic to the top positions.
12]. For each WPSS we selected the pages with highest score, creating a collection of pages to be evaluated by a pool of humans. experts on the specific topics independently labelled each page in the collection as ``authoritative for the topic'' or ``not authoritative for the topic''. Such a reliable set of judgments was finally used to measure the percentage of positive (or negative) judgments on the best pages returned by each ranking function. In particular, was varied between and . The topics selected for these experiments were ``Linux'' and ``Golf''. As in the previous experiments 150.000 pages were collected by focus crawling the Web.
Figure 6 reports the percentage of positive judgments on the best pages returned by the five WPSSs, respectively, for the topic ``Linux'' and ``Golf''. In both cases the HITS algorithm is clearly the worst among the other ones. Since its performance decreases significantly when applied to the entire collection of documents, it can only be used as a query-dependent ranking schema .
We would like to thank Ottavio Calzone who performed some of the experimental evaluations of the scoring systems.
- M. Henzinger, ``Hyperlink analysis for the Web,'' IEEE Internet Computing, vol. 1, pp. 45-50, January/February 2001.
- L. Page, S. Brin, R. Motwani, and T. Winograd, ``The PageRank citation ranking: Bringing order to the web,'' tech. rep., Computer Science Department, Stanford University, 1998.
- J. Kleinberg, ``Authoritative sources in a hyperlinked environment.'' Report RJ 10076, IBM, May 1997, 1997.
- K. Bharat and M. Henzinger, ``Improved algorithms for topic distillation in a hyperlinked enviroment,'' in Proceedings of the 21st ACM SIGIR Conference on Research and Developments in Information Retrieval, pp. 104-111, 1998.
- R. Lempel and S. Moran, ``The stochastic approach for link-structure analysis (SALSA) and the TKC effect,'' in Proceedings of the 9th World Wide Web Conference, 2000.
- R. Lempel and S. Moran, ``Salsa: The stochastic approach for link-structure analysis,'' ACM Transactions on Information Systems, vol. 19, pp. 131-160, April 2001.
- D. Cohn and H. Chang, ``Learning to probabilistically identify authoritative documents,'' in Proc. 17th International Conf. on Machine Learning, pp. 167-174, Morgan Kaufmann, San Francisco, CA, 2000.
- D. Cohn and T. Hofmann, ``The missing link: a probabilistic model of document content and hypertext connectivity,'' in Neural Information Processing Systems, vol. 13, 2001.
E. Seneta, Non-negative matrices and Markov chains.
- M. Joshi, V. Tawde, and S. Chakrabarti, ``Enhanced topic distillation using text, markup tags, and hyperlinks,'' in International ACM Conference on Research and Development in Information Retrieval (SIGIR), August 2001.
- M. Diligenti, F. Coetzee, S. Lawrence, L. Giles, and M. Gori, ``Focus crawling by context graphs,'' in Proceedings of the International Conference on Very Large DataBases, 11-15 September 2000, Il Cairo, Egypt, pp. 527-534, 2000.
- B. Amento, L. Terveen, and W. Hill, ``Does authority mean quality? predicting expert quality ratings of Web documents,'' in Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 296-303, 2000.
- ... PageRank