Adaptive Ranking of Web Pages
Ah Chung Tsoi
Office of Pro ViceChancellor(IT)
University of Wollongong
Wollongong, NSW 2522, Australia
Office of Pro ViceChancellor(IT)
University of Wollongong
Wollongong, NSW 2522, Australia
Gianni Morini
Dipartimento di Ingegneria dell'Informazione
Universita' degli studi di Siena
Siena, Italy
Dipartimento di Ingegneria dell'Informazione
Universita' degli studi di Siena
Siena, Italy
Franco Scarselli
Dipartimento di Ingegneria dell'Informazione
Universita' degli studi di Siena
Siena, Italy
Dipartimento di Ingegneria dell'Informazione
Universita' degli studi di Siena
Siena, Italy
Markus Hagenbuchner
Office of Pro ViceChancellor (IT)
University of Wollongong
Wollongong, NSW 2522, Australia
Office of Pro ViceChancellor (IT)
University of Wollongong
Wollongong, NSW 2522, Australia
Marco Maggini
Dipartimento di Ingegneria dell'Informazione
Universita' degli studi di Siena
Siena, Italy
Copyright
is held by the author/owner(s).
WWW2003, May 2024, 2003, Budapest, Hungary.
ACM 1581136803/03/0005.
Abstract:
In this paper, we consider the possibility of altering the PageRank of web pages, from an administrator's point of view, through the modification of the PageRank equation. It is shown that this problem can be solved using the traditional quadratic programming techniques. In addition, it is shown that the number of parameters can be reduced by clustering web pages together through simple clustering techniques. This problem can be formulated and solved using quadratic programming techniques. It is demonstrated experimentally on a relatively large web data set, viz., the WT10G, that it is possible to modify the PageRanks of the web pages through the proposed method using a set of linear constraints. It is also shown that the PageRank of other pages may be affected; and that the quality of the result depends on the clustering technique used. It is shown that our results compared well with those obtained by a HITS based method.Categories and Subject Descriptors
H.3.3 [Information storage and retrieval]: Information Search and Retrieval; H.5.4 [Information interfaces and presentation]: Hypertext/HypermediaGeneral Terms
Search Engine, Page rankKeywords
Adaptive PageRank determinations, Learning PageRank, quadratic programming applications1. Introduction
Google is probably one of the most popular internet search engines available today. A major feature of the Google search engine lies in its arrangement of the most ``relevant'' pages with respect to a user's inquiry [1]. This set of most ``relevant'' pages is obtained by what is known as the PageRank algorithm [2,3,4], in which the web pages are ranked according to the number of links which point to it. A page with many links pointing to it is highly authoritative. In [2], an algorithm was given to determine the PageRank of the web pages according to their link structures. It is shown in [2] that the PageRank of the set of web pages can be computed using the following recursive algorithm:where is an dimensional vector denoting the PageRank of the set of web pages. denotes the evolution of the PageRank vector at the th iteration. is a matrix with elements if there is a hyperlink from node to node , and is the total number of outlinks of node , and otherwise. is an dimensional vector with all elements equal to 1. is a damping factor. The PageRank of the set of web pages is given by the steady state solution of (1).
At the steady state, we have
Note that the PageRank of the set of web pages is determined once the link structure among the web pages is given, and fixed. In other words, the PageRank of a set of web pages can be determined uniquely once the link structure of the web pages is fixed, and is satisfied [1,3]. Note also that PageRank is only one among a number of factors which Google uses to arrange the final score^{}of a particular web page. There are various attempts in altering the PageRank of a set of web pages. There are two perspectives: (a) from the perspective of users, and (b) from the perspective of web site administrators. There is much discussion in the commercial literature on how to influence the PageRank of web pages from a user's point of view. This is in recognition of the economic gain which might be derived from having highly ranked web pages, as a human web surfer is mostly interested in the first few pages of returned universal resource locators (URLs) from the search engine query on a particular topic. There are various techniques deployed in the search engine optimization literature, mainly in changing the link structure of web pages which are under the control of the user. As indicated previously, the link structure is one of the factors which will determine the Google's score. Hence, by modifying the link topology, paying special attention to the way that the PageRank is computed, it is possible to raise the PageRank of selected pages under the control of the user. In addition, by exchanging links with other users, it is possible to raise the PageRank of selected pages further. From the web administrator's point of view, there are also reasons for raising or decreasing the PageRank of certain web pages. For example, if a user uses a ``link farm'' to artificially inflate their PageRanks, it would be useful if the web site administrator has a way to decrease the influence of the link farm mechanism in its determination of the PageRanks for other web pages. In the case that the web site is used as a special purpose portal, there may be situations in which the web site administrator would wish to increase the PageRank of web pages which are related to the topics carried by the portal. For example, the administrator of a portal on wine may wish that pages about wine have a higher rank than other pages which are unrelated to wine. The administrator of a search engine may need to decrease the rank of spamming pages; or the administrator of a site may wish that the energy^{}of his/her site is higher than the energy of a competitor. In this paper, we will consider possible modifications of PageRanks from a web administrator's point of view. It is assumed that the web administrator has no possibilities of modifying the topology of the link configuration of the web pages. The only mechanism which is opened to the web administrator is to modify the PageRank equation (1) by modifying the ``control'' variable in that equation. Consider the PageRank equation (1). It is simple to note that this equation can be written in more conventional notations as follows:
where is a dimensional vector, often called the state of the equation, is a dimensional vector, called the input vector. is constant matrix and is constant matrix. In the case of PageRank equation, , , the identity matrix, , and has all elements equal to . Thus, the issue here is how to design the control variable in such a way that the PageRanks of a set of selected web pages are altered. There are two ways in which we can modify the PageRank by manipulating the control variable :
 Dynamic control. In this case, the issue is to design a set of controls , such that the PageRanks is modified (when they reach the steady state).
 Static control. In this case, recognizing the fact that the PageRank is the steady state solution of (1), it might be possible to design in such a way that the PageRanks of selected web pages are modified.
1.1 The constraints
In the simplest case, the ranks of
selected pages are set to predefined target values. For
example, the administrator of a search engine may notice that a page is ranked lower
than desired and
may decide to increase its rank by . Thus, the
target for that page will be the original rank times .
Formally, let us assume that pages
have the targets
,
respectively, whereas the ranks of other pages are undefined. The constraint
can be written as follows:
where , the superscript denotes the transpose of a vector, or a matrix, and is a projection matrix such that . In other cases, one may wish to establish an ordering on the pages. For example, we can enforce by the inequality , where , and . Similarly, one can constrain the energy of the site to be larger than a threshold , by the inequality . More generally, a set of rules on the PageRanks can be formally defined as follows:
where and . These rules can constrain the ordering between two pages; they can enforce the ordering between the energy of two communities; they can establish a range of desired values for the rank of a page and so on.
1.1.1 Additional constraints
Most likely administrators have no a priori information concerning the range of PageRanks of vector ; therefore a common constraint will be to deal with the order of the pages, rather than the PageRanks. For example, if our set of pages consists of and we wish that the rank be more important than , it seems natural to writewhere is a coefficient chosen by the administrator. The following example shows that such a constraint does not necessarily ensure the desired result:
=  =  
=  = 
True  
False 
1.2 Cost
It is reasonable to adopt the following assumption: apart from the web pages whose PageRanks
need to be modified, for the rest of web pages in the web site, we do not wish to perturb their
current rank if possible.
Let
be the PageRank of a set of web pages as obtained from (2). Let
be the
modified PageRanks the same set of web pages when we apply the control. Then, it is reasonable
to expect that the following cost function can be imposed:
where denotes the norm of the set of vectors, . In this paper, we will only consider the case where .
1.3 Summary
The problem of modifying the PageRank can be posed as follows:where is an dimensional vector, denoting the set of control variables, subject to the constraints:
 ,
 .
2. First solution
Since the PageRank is obtained as the steady state solution of (1),
it is reasonable to infer that we will only be interested in the steady state solution
of (8). Towards this end, if we define
(10) 
which is the same as PageRank except for the vector in place of . We can substitute (11) in (9) to obtain
Now we can consider the cost function as follows:
(13) 
where the constant term can be omitted in the minimization of the function
Finally, we can use (14) as cost function in
Notice that (15) is a standard positive definite quadratic programming problem with an inequality constraint set which can be solved very efficiently [5]. The problem fits in the positive definite quadratic programming problem because is positive definite with an inequality constraint set. The solvability of this problem is given as follows:
 is positive definite. In this case, it is satisfied.
 , and , where the superscript denotes the optimal solution and denotes the reduced matrix which contains only rows in which the constraints are satisfied, assuming that there are only constraints which are active.
 , where is the set of Lagrange multipliers, and , .
3. Practical solution
For the world wide web, , the dimension of the PageRank equation, or the modified PageRank
equation can be in the region of billions. Hence it would not be possible to solve the quadratic
programming problem formulated in Section 2, apart from some very simple problems of which
is of low order.
In this section, we will introduce a practical method for solving the situation when is large.
The key is to group pages in the world wide web together into clusters, and thus reducing the
number of dimension of the state space which needs to be considered.
3.1 Reduce the complexity
Let us consider a partition
of web pages. In other words, we wish to partition
the total number of web pages into clusters. These clusters can be arranged
according to some criteria, e.g., approximately the same PageRank, approximately the same score.
For sake of simplicity, we further assume that
the pages are ordered so that the pages in cluster are
,
the pages in cluster are
and so on.
In our approach, the vector is defined by parameters (control variables),
one parameter for each cluster. The main reason why we use this
technique is that the number of control variables in the modified PageRank equation is
the same as the number of states. Hence, if the number of states is reduced through clustering from
to , then the corresponding number of control variables is reduced as well.
More precisely, we have
(16) 
where , , are the vectors obtained by summing all the columns that correspond to the class . We define further:
Thus, we can write which allows us to rewrite (12) as follows:
where . Moreover, note that the vectors fulfill
where is the vector where the th component is if th page belongs to th class and , otherwise. Thus, the vectors can be easily computed, using the same approach adopted for PageRank, by
Let us define the clustering matrix
We can use in the following equation:
Since , so we can write . Now we can focus on the cost function (14). We use (21) to write:
In the following, we use for the quadratic term and for the linear term in the cost function
Consequently, in order to find the optimal solution , we proceed to solve the quadratic programming problem^{}
Note that , where the number of constraints and the number of clusters is small, and .
4. Relaxed Constraints Problem
In the event that administrators define contradicting
constraints then (23) has no feasible solutions.
In order to compute a suboptimal solution, we introduce a method to regulate the strength of the constraints.
Instead of forcing the algorithm to fulfill the constraints, we add a new term to the cost function:
where the coefficient is used to balance the importance between constraints and the original cost function. We wish to express (24) in a standard quadratic programming formulation. We first focus on the second term in (24)
Notice that we can remove the constant term without affecting the solution. We can substitute in (24)
Let us define
Matrix is positive semidefinite as well and we can finally write (24) in the following manner:
This approach allows us to find a suboptimal solution for every constraint set. The parameter influences the resulting rank vector in the following manner:
4.1 Remarks
Our proposed method can be summarized as
follows:
 Step 1
 Use a clustering algorithm in order to split the pages of the Web into clusters
.
 Step 2
 Compute the by solving the related system defined by (19).
 Step 3
 If (9) has a feasible solution, solve the quadratic programming
problem (23) in order to compute the optimal set of parameters
 Step 4
 If (9) has no feasible solutions, solve the quadratic programming
problem (27) to compute a suboptimal set of parameters
.
 Step 5
 Compute the rank as
where are parameters that can be computed using the previously adopted reasoning. In this case, will be calculated by
5. Experiments
For the experiments conducted in this section we use a subset of the WT10G
data set, as distributed by CSIRO in Australia.
It pays special attention to the connectivity among the web pages. Hence this
set of web collection is suitable for evaluations of search engine algorithms based on
connectivity concepts, e.g., PageRank method.
Some of the properties of WT10G dataset are as follows:
 1,692,096 documents from 11,680 servers.
 171,740 interserver links (within dataset)
 9,977 servers with interserver inlinks (within dataset)
 8,999 servers with interserver outlinks (within dataset)
 1,295,841 documents with outlinks (within dataset)
 1,532,012 documents with inlinks (within dataset)
 Is the proposed algorithm effective in rearranging the pages as desired?
 How does the application of constraints on some pages affect the ranking of other pages in the collection?
 The effects of the number of clusters on the performance of the algorithm.
 The effect of clustering methods on the performance of the algorithm.
5.1 Constraints effect
For this initial set of experiments
we chose to cluster the pages into 15 equally sized clusters.
The clusters are formed by sorting the pages according to their PageRanks in
descending order as shown in Figure 1.
The highest ranked pages are assigned to the first cluster.
From the remaining set of pages we assign the next highest ranking pages to cluster2, and so on.
In order to simplify the evaluation, we consider only
one type of
constraints, viz., to swap the importance of two pages located at a distance apart.
Our aim is to find the effectiveness of the proposed algorithm, and to
understand the influence of a single constraint on the page order of the rest of the web collection.
Thus, we consider a page with absolute position
and page with absolute position
,
and impose a constraint as follows:




 The proposed algorithm is effective in modifying the PageRank as desired.
 It is observed that a constraint on highly ranked pages disturbs the PageRank of the rest of the web collection more significantly. For example, for , it is observed that the standard deviation of the PageRank is as compared to when .
 It is noted that the perturbations of the pages appear in blocks. We found that this is due to the clustering of the web pages. Pages belonging to a cluster are bound together by the same parameter. This finding implies that the quality of the result is influenced by the number of clusters used.
 It is observed that when swapping the positions of two pages, the effect on lower ranked pages is stronger than on higher ranked pages.
 The observation that the location of higher ranked pages are perturbed by applying constraints on lower ranked pages can be explained by the fact that lowly ranked pages can have parents which are highly ranked. Hence, if we perturb the rank of the lowly ranked pages, then this can have an effect on parent pages which might be ranked higher. In other words, if we perturb the PageRanks of web pages, then we will perturb their associated ancestors as well. The extent to which the perturbation manifests itself depends on the original rank of the pages affected.
5.2 Number of clusters
We investigate the influence of the number of clusters which were introduced to reduce the complexity. We conduct experiments that gradually increase the number of clusters used from 5 to a maximum of 100 clusters. The effect is evaluated by considering the cost function, and the standard deviation of the absolute position as a measure. The result is given by Figure 6. It is possible to reduce the level of complexity of the algorithm by using the idea of clustering.
 The number of clusters may be quite small in comparison with the total number of web pages in the collection.
5.3 Clustering techniques
A number of ways in which web pages can be clustered are considered in the following: (a) Clustering by score
 This simple method uses a classifier to assign a coefficient
of affinity about a specific topic to each page in the web graph.
We refer to this coefficient as the score for the page .
Given a fixed number of cluster , we compute the
score range
A page belongs to cluster if . An effect is that clusters will be of different size. The dimension of each cluster depends on the distribution of the score in the graph.  (b) Clustering by rank
 This method suggests to use the PageRank as computed in (2).
Given a fixed number of clusters we compute the
rank range
A page belongs to cluster if . The dimension of each cluster depends on the distribution of the rank in the web graph.  (c) Clustering by rank with fixed cluster dimensions
 The dimension of each cluster can be fixed to , where is the dimension of the web graph. This is done by ordering the pages of the graph according to the rank as computed in (2), and assign the first pages to the cluster , the second set of pages to the cluster and so on. This clustering method was used for the experiments shown earlier in this section.
 (d) Clustering by rank
 with variable cluster dimensions using a set regime The idea of this method is motivated by observations made on experimental findings made earlier. The idea is to treat highly ranked pages differently because they play a critical role. The size of the cluster is to be smaller for clusters that contain more relevant pages, while we can tolerate larger dimensions for clusters that contain relatively irrelevant pages. We define a coefficient and a multiplication factor . We order the pages of the web graph according to the rank as computed in (2) and we assign the first pages to the cluster , the second set of pages to the cluster and so on. For example, for and the resulting cluster dimensions will be
Std. deviation: 27797.1
Std. deviation: 24668.7
Std. deviation: 3619.31
 The clustering scheme has a significant influence on the quality of the result.
 To build clusters from page scores generally produces the worst performance in terms of the perturbation on the PageRanks.
 Clustering methods based on rank gives good results.
 The clustering by ranks using a variable dimension by considering the magnitude of the PageRank appears to be working best.
5.4 Applying constraints on pages from a web community
In this experiment, we investigate the effect of a constraint that involves pages belonging to the same site ^{}. The question which we like to answer is: are the changes in the absolute position of the web pages caused by the proposed algorithm primarily affect pages from within the same community? In other words, could the changes in the PageRanks mostly come from the same community. This is quite a reasonable hypothesis in that if we wish to swap two pages in the same community, then most of the changes in the PageRanks might come from the same community, and only to a lesser extent come from other web pages unrelated to the community. We carry out an experiment to evaluate this proposition. We chose a community ``Stanford University''. In our web collection of 150,000 web pages we found 105 web pages which are from the Stanford community. In order to minimize the effect of clustering, we use 50 clusters, where each cluster is of dimension 2000. The constraint is to swap the position of two pages from the Stanford community, one located at (absolute) position 4377, and the other located at absolute position 6246. The results of the experiment are shown in Figure 9.5.5 Comparisons with other methods
To the best of our knowledge, the only alternative approach to alter the ranking of web pages from an administrator's point of view has been suggested by Chang etal.[6]. The underlying idea is to extend Kleinberg's HITS algorithm [7] by altering the eigenvector such that a given target page is given more weight. This is performed by using a gradient descent method on the elements of the link matrix . Chang etal. [6] indicated that their algorithm not only increases the rank of a given page but also increases the rank of all pages that are similar to the given page. A set of experiments was conducted to allow a qualitative comparison of our proposed algorithm with that proposed in [6]. The first striking difference of the two algorithms is the computational complexity. Chang et al.'s algorithm relies on a recursively applied multiplication of link matrices which resulted in nonsparse matrices. As a consequence, it is unlikely that Chang et al.'s algorithm is able to process link matrices which arise from the live world wide web. In practice, we found that we were unable to execute experiments for on a dual Xeon2GHz system equipped with 4GB of memory due to memory requirements. In actual fact we had to reduce even further to allow experiments to be executed within a reasonable amount of time.
6. Conclusions
In this paper, we first consider the possibility of modifying the ranking of web pages belonging
to a web collection by allowing the design of a set of control variables. We formulate
the problem as a standard quadratic programming problem in minimizing a cost function which
is the deviation of the absolute position of the web pages after being processed
by the proposed algorithm, and that of the original ranking as given by Google's PageRank algorithm,
subject to a number of constraints. Then we carry out a set of experiments designed to
evaluate the validity and understand the behaviours of our proposed algorithm.
It is found that it is possible to find a solution to the given task. In addition,
it is found that the PageRanks of all web pages are affected when placing a constraint on some of the web pages.
The effect on the global PageRanks depends on the rank of the pages to which a constraint is applied,
and on the clustering method chosen. If these pages are located in the relatively high ranking range,
then the PageRanks of the web pages will be perturbed more violently. On the other hand, even if
constraint affected pages are located in the relatively low rank region, it is observed that
higher ranked pages may also be perturbed. This is explained by the fact that these highly ranked pages
are the ancestors of the lowly ranked pages. Hence by altering the PageRanks of the lowly
ranked pages, it is possible that the ancestors of these lowly ranked pages would need to be
perturbed as well. The effect of PageRank perturbation is minimized by choosing a sufficiently
large number of clusters, where the clusters are formed with respect to
the magnitude of the rank of pages. Also, moderate constraints placed on lower ranked
pages significantly reduce PageRank perturbations.
An issue which we have not addressed in this paper is the nonlinear relationship between
rank and position. This is best illustrated as follows:
Ideally we do not wish to perturb the order
of the web
pages induced by the function applied on the rank vector
computed in (2) (see example in Figure (11).
The function is highly nonlinear. Because this mapping function is highly nonlinear,
it could happen that while the constraints on the web pages are satisfied, the absolute position
of the resulting web pages could be worse off. For example, if we wish to swap the position
of two web pages located in position and , and . After we process this using the proposed algorithm,
the two pages are located in positions and respectively, with . It could happen that
. In other words, the overall positions of the two swapped pages are worse off
than before the modification. This seems to defeat the purpose of the modification,
in that we wish to improve the position of particular web pages, with respect to
others. Thus, it would be useful to have an algorithm which will preserve the absolute position of the
designated web pages.
Secondly, in our experiments we have only considered a single constraint, viz., to swap
the position of two designated web pages. We could have imposed more constraints.
However, the picture is more complex, as it would be difficult to draw conclusions on
observations on experimental results. It would be useful if there is a more systematic method
for finding out the effects of multiple constraints on the modification of PageRanks.
These tasks are presented as a challenge for future research.
7. References
 1
 Bianchini, M., Gori, M., Scarselli, F. ``Inside PageRank'', Tech. Report DII 1/2003, University of Siena, Italy, 2003.
 2
 Brin, S., Page, L. ``The anatomy of a large scale hypertextual web search engine''. Proceedings of the 7th WWW conference, April, 1998.
 3
 Ng, A.Y., Zheng, A.X., Jordan, M.I. ``Stable algorithms for link analysis'', in Proceedings of IJCAI2001, 2001.
 4
 Zhang, D., Dong, Y. ``An efficient algorithm to rank web resources'', in Proceedings of the 9th WWW Conference, Elsevier Science, 2000.
 5
 Gill, P., Murray, W., Wright, M., Practical Optimization. Academic Press, 1981.
 6
 Chang, H., Cohn, D., McCallum, A.K., ``Learning to Create Customized Authority Lists'', Proc. 17th International Conf. on Machine Learning, 2000.
 7
 Kleinberg, J., ``Authoritative sources in a hyperlinked environment'', Proc. 9th ACMSIAM Symposium on Discrete Algorithms, 1998.
About this document ...
Adaptive Ranking of Web Pages
Converted by Latex2Html. The command line arguments were:
latex2html nonavigation noaddress transparent white style http://www2003.org/www2003.css prefix 820tsoi split 0 820tsoi.tex
The translation was initiated by Markus Hagenbuchner on 20030405
Footnotes
 ... score^{}
 In this paper, we make a distinction between ``PageRank'' and ``score''. PageRank is the value which is obtained from the steady state solution of (1), while score is the final value which is obtained by taking into account a number of other factors, e.g., the anchor text, keywords, etc.
 ... energy^{}
 The sum of the PageRank of the web pages in the site [1] as a measure of the collective ``power'' of a set of web pages.
 ... wine^{}
 This is assuming that the clustering method used is related to the content of the pages. This would not be true for clustering methods based on other criteria, e.g., scores.
 ... problem^{}
 The standard quadratic programming formulation for the function cost is where is a symmetric matrix.
 ... site^{}
 A site is a collection of pages from the same domain (e.g. stanford.org).
 ... inlinks^{}
 The smaller the subset of webpages, the less connectivity between pages, causing more pages to be isolated.