A Subjective Measure of Distance on the WWW
Yutaka Matsuo
University of Tokyo & Japan Science and Technology Corporation
Hongo 731, Bunkyoku, Tokyo 1138656, Japan
+81358417773
matsuo@miv.t.utokyo.ac.jp
Yukio Ohsawa
University of Tsukuba & Japan Science and Technology Corporation
Otsuka 3291, Bunkyoku, Tokyo 1130012, Japan
+81339427141
osawa@gssm.otsuka.tsukuba.ac.jp
Mitsuru Ishizuka
University of Tokyo
Hongo 731, Bunkyoku, Tokyo 1138656, Japan
+81358416755
ishizuka@miv.t.utokyo.ac.jp
ABSTRACT
In this paper, we propose a new definition of the distance between two pages, called averageclicks. It is based on the probability to click a link through random surfing. We compare the averageclicks measure to the classical measure of clicks, and show that averageclicks fits better to the users' subjective distance.
Keywords
averageclicks, link structure, random surfer model, subjective distance
1. INTRODUCTION
There have been a number of recent activities on how to exploit the link structure of the Web. Most of these works analyzing the structure of the Web graph assume the length of each link to be 1 (unit), and the clicks between two pages are counted to measure the distance. However, the distance measured by the number of clicks doesn't reflect well the users' subjective distance. For users, it requires a great effort to find and click a link among a large number of links than a link among a couple of links. If we count minimal clicks to measure the distance between two pages, the path is likely to include link collections, such as Yahoo! directories [1].
In this paper, we propose a new definition of the distance between two pages, called averageclicks instead of the classical "clicks" measure. This measure reflects how many "average clicks" are needed from a page to another page. An averageclick is one click among n links. And two averageclicks is a distance of two successive clicks among n links for each, or one click among n^{2} links. (We employ n=7.) The averageclick is defined on the probability for a "random surfer" to reach the page: A random surfer keeps clicking on successive links at random. The probability for a random surfer in page p to click one of the links in page p is considered as 1/OutDegree(p) in this model. We annotate the link in page p with the length of log_{n} (1/OutDegree(p)). An averageclick is a unit distance of this measure.
2 AVERAGECLICKS
By a random surfer model, the probability to click each link in page p is α/OutDegree(p), where α is a damping factor and OutDegree(p) is the number of links page p has. In probability 1α, a random surfer jumps to a random Web page. Following [2] α is usually set to be 0.85, however, we set α=1 for simplicity.
We annotate a link with the length as negative logarithm of probability, so that summing lengths is akin to multiplying probabilities. A length of a link in page p is defined as
The distance between two pages p and q is defined by the shortest path, that is, the sum of the length of the shortest path from p to q. From a probabilistic point of view, this is equivalent to focus only on the path with the largest probability for a random surfer to get from page p to page q.
Fig. 1: Averageclicks and clicks.
Fig. 1 illustrates some pages and the links between them. Page A has three links, thus the length of each link is log_{7} (1/3) ≈ 0.56 averageclick. As page B has seven links, the length is each 1 averageclick. Summing 0.56 and 1, the distance from A to C is 1.56 averageclick. In the case of page D, there is two paths from page A to D. The averageclicks is smaller in the lower path, though it takes three clicks. The shortest path in terms of averageclicks is the lower path, while the path with minimal clicks is the upper path.
This model offers a good approximation to our intuitive concept of distance between Web pages. For example, Yahoo! toppage has currently more than 180 links. In our definition, the length from the toppage to each subpage is far, as the upper path in Fig. 1. On the other hand, the path length by the local relation, such as the link to one's friends or the link to one's interests, is estimated rather short, as in the lower path of the figure. Intuitively we think the path through the Yahoo! toppage is longer than the path along the acquaintance chain with the same clicks. In our model, page C is more distant from page A than page D, and this fits well to our intuition.
3. EVALUATION
We implement the bestfirst algorithm to search the shortest path from a starting URL to a target URL. This bestfirst algorithm is similar to the breadthfirst algorithm for clicks measure.
The relation of averageclicks and users' subjective distance is evaluated by questionnaires. We asked 13 participants to rank the pages according to their perceived familiarity. First we pick up 30 pages randomly within a few clicks from each participant's homepage. Then, we asked him/her to answer how familiar each URL of the page is, without providing the contents of the pages or any distance measures. Answers to the questions were made on a 5point Likertscale from 1 (very familiar) to 5 (very distant). After the questionnaires, we compare users' ratings with the distance by clicks/averageclicks.
Fig. 2 and 3 show the scatter plot of the results by participant 1. We can see clearly that the rating is correlated with the averageclicks measure. The correlation coefficients of 13 participants are shown in Table 4. Averageclicks have generally stronger correlation with the users' rating. In all the case, the correlation of clicks exceeds the correlation of averageclicks.
Table1: The correlation coefficient of participants' rating and clicks/averageclicks.
Participant  Correlation coefficient  Significance level of difference 


Clicks  Averageclicks  
1  0.524  0.836  1% 
2  0.325  0.804  1% 
3  0.471  0.685  5% 
4  0.268  0.572  5% 
5  0.641  0.674  above 5% 
6  0.696  0.715  above 5% 
7  0.517  0.699  above 5% 
8  0.490  0.569  above 5% 
9  0.499  0.528  above 5% 
10  0.435  0.512  above 5% 
11  0.358  0.436  above 5% 
12  0.116  0.400  above 5% 
13  0.302  0.367  above 5% 
Average  0.434  0.600   
Fig 2: Scatter plot of answers and averageclicks by participant 1.
Fig 3: Scatter plot of answers and clicks by participant 1.
Table2: Average correlation coefficient to understandability, informativeness and interestingness.
Clicks  Averageclicks  

Understandability  0.316  0.434 
Informativeness  0.283  0.453 
Interestingness  0.148  0.159 
We made another questionnaire. Showing the participant the whole contents of the page for each 30 URL, we asked the following three questions:
 Understandability; "How easy is this page to understand?" from 1 (very difficult) to 5 (very easy),
 Informativeness; "How much does this page include what you don't know?" from 1 (very little) to 5 (very much),
 Interestingness; "How interesting is this page?" from 1 (very dull) to 5 (very interesting).
Table 2 shows the average correlation coefficient of participants. Understandability has a negative correlation to the distance, that is, if the page is further, it gets more difficult to understand the contents. Informativeness has a positive correlation to the distance, and interestingness has almost nothing to do with distance. Understandability and informativeness have stronger correlation with the averageclicks than clicks measure. Averageclicks approximates the distance between the context of the participant homepage and the other page.
4. DISCUSSION and CONCLUSION
The Web is an example of a social network. Our research also reveals a part of a social network. In [4], the weight of a link is defined by referring to the text of the page. The averageclicks measure requires only the number of links.
The classical clicks measure is intuitively understandable for all Internet users, while the distance based on the probability is relatively difficult to understand. That's why we bring semantics by setting the base of the logarithm to the average number of links in a page: The distance shows how many "average clicks" are needed from one page to another page. This type of length (or cost) assignment is common in the context of costbased abduction, where finding the MAP (maximum a posteriori probability) solution is equivalent to finding the minimal cost explanation for a set of facts [5]. Fetching the neighboring pages is a common procedure in many algorithms. Averageclicks measure provides a good justification of the practical search, such as "if there are few links, fetch the pages, but if there are many links, give up."
Our experiment is of course not conclusive, however we are convinced that at least in some cases we should use averageclicks instead of clicks measure.
6. REFERENCES
 Yahoo! http://www.yahoo.com
 L.Page, S.Brin, R.Motwani, and T.Winograd. The PageRank citation ranking: Bringing order to the web.Technical report, Stanford University, 1998.
 K.Bharat and A.Broder. A technique for measuring the relative size and overlap of public Web search engines. In Proc. 7th WWW Conf., 1998.
 S.Chakrabarti, B.Dom, P.Raghavan, S.Rajagopalan, D.Gibson, and J.Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proc. 7th WWW Conf., 1998.
 E.Charniak and S.E.Shimony. Costbased abduction and MAP explanation. Artificial Intelligence 66:345 374,1994.