An Efficient Algorithm to Rank Web Resources
Dell Zhang, Yisheng Dong
Department of Computer Science & Engineering, Southeast University, Nanjing, 210096, China
How to rank Web resources is critical to Web Resource Discovery (Search Engine). This paper not only points out the weakness of current approaches, but also presents in-depth analysis to the multidimensionality and subjectivity of rank algorithms. From dynamics viewpoint, this paper abstract a users Web surfing action as a Markov model, based on this model, we propose a new rank algorithm. The result of our rank algorithm, which synthesized the relevance, authority, integrativity and novelty of each Web resource, can be computed efficiently not by iteration but through solving a group of linear equations.
Keywords: Web, resource discovery, search engine, rank algorithm
The World Wide Web is rapidly emerging as an important medium for the dissemination of information related to a wide range of topics . There are about 300 million pages on the Web today with about 1 million being added daily. According to most predictions, the majority of human information will be available on the Web in ten years. But, it is widely believed that 99% of the information on the Web is of no interest to 99% of the people. Looking for something valuable in these tremendous amount of information is as difficult as looking for a needle in a bottle of hay.
Searching for valuable information in the Web is called resource discovery (RD). The IETF-RD group argues that resource discovery should provide user a consistent, organized view of information . In a typical RD procedure, the user submit a query Q , which is simply a list of keywords (with some additional operators), to the RD server (RDS), then RDS returns a set of related Web page URLs: R1,R2,Rn. There are many search engines to support RD in the Web, such as Yahoo! http://www.yahoo.com), AltaVista(http://altavista.digital.com), Excite (http://www.excite.com), Hotbot (http://www.hotbot.com), Infoseek(http://ww.infoseek.com), etc. A search engine usually collect Web pages on the Internet through a robot (spider, crawler) program, then these Web pages are automatically scanned to build giant indices, so you can quickly retrieve the set of all Web pages containing the given keywords.
RD in the Web is especially difficult due to the following five characteristics of the Web data source: (1) huge and ubiquitous; (2) mostly semistructured or unstructured; (3) diverse in quality; (4) dynamic; (5) distributed and autonomous. In particular, a topic of any breadth will typically contain several thousand or million relevant Web pages. For instance, if you enter the search engine AltaVista, input "data mining", over 50,000 Web pages will be found. Yet, a user will be willing, typically, to look at only a few of these pages.
The rank algorithm, can then help a user to select the correct ones (those of most value to him or her), from this sea of Web resources. Given a Web resource r and a user's query q, the rank algorithm will compute a score . The bigger is, the more valuable r to q, i.e., the more valuable for the user. In practice, RD in the Web can be viewed as fuzzy queries driven by the rank algorithm, but not SQL-style precise queries. Moreover, a good rank algorithm is very helpful to crawl the Web more efficiently . Meta- search engines also need the rank algorithm to synthesize the results from other search engines [1, 10]. All in all, we argue that the rank algorithm is the core technique of a search engine, it is critical to improve the quality of RD in the Web.
This paper is organized as follows. Section 2 points out the weakness of current approaches in ranking Web resources. Section 3 presents in-depth analysis to the multidimensionality and subjectivity of rank algorithms. Section 4 abstracts a users Web surfing action as a Markov model and we propose a new rank algorithm based on this model. Section 5 concludes this paper and discusses future work.
2. State Of The Art
At the present time, most rank algorithms of Web resources are using similarity measure based on vector-space model, which has been well studied in the Information Retrieval (IR) community. To compute the similarities, we can view each document as a n-dimensional vector <w1,, wn>. The term wi in this vector represents the ith word in the vocabulary. If wi does not appear in the document, then wi is zero. If it does appear, wi is set to represent the significance of the word. One common way to compute the significance wi is TFxIDF, i.e., to multiply the number of times the ith word appears in the document (TF) by the inverse document frequency (IDF) of the ith word. The IDF factor is one divided by the number of times the word appears in the entire "collection," which in this case would be the entire Web. The IDF factor corresponds to the content discriminating power of a word: a term that appears rarely in documents (e.g., "algebra") has a high IDF, while a term that occurs in many documents (e.g., "the") has a low IDF. The similarity between query Q and document R can then be defined as the inner product of the Q and R vectors. Another option is to use the cosine similarity measure, which is the inner product of the normalized vectors. The wi terms can also take into account where in a HTML page the word appears, for instance, words appearing in the title may be given a higher weight than other words in the body . Along with the popularization of Web meta-data standards such as RDF, it becomes feasible to take advantage of the meta-data of Web resources, which can also improve the rank algorithm's accuracy .
But the rank algorithms derived from IR have lots of limitation, they only evaluate the content, but totally neglect the quality of Web resources. So these rank algorithms can be easily cheated. Webmasters can make their sites highly ranked through inserting some irrelevant but popular words (e.g. "Clinton", "sex") into important places (e.g. page title) or meta-data. This phenomenon is called Search Engine Persuasion (SEP) or Web Spamming [11,14].
Recent researches in this area concentrate on mining the linkage structure of Web resources to support RD in the Web [3,5,7,11,16]. The typical one in such rank algorithms is PageRank , which is proposed by the Stanford University and has been applied in the famous search engine Google (http://www.google.com/). The PageRank metric, PR(P), recursively defines the importance of a page P to be the weighted sum of the back-links to it. Such a metric has been found to be very useful in ranking results of user queries. More formally, if a page has no outgoing link, we assume that it has outgoing links to every single page. Consider a page P that is pointed at by pages T1, T2,,Tn. Let C(Ti) be the number of links going out of page Ti . Also, let d be a damping factor. Then, the weighted back-link count of page P is given by
This leads to one equation per Web page, with an equal number of unknowns. The equations can be solved iteratively, starting with all PR values equal to 1. At each step, the new PR(P) value is computed from the old PR(Ti) values (using the equation above), until the values converge. Through the famous Perron-Frobenius Theorem , we can find out that this calculation corresponds to computing the principal eigenvector of the linkage matrix.
Theorem 2.1 (Perron-Frobenius Theorem )
If a n-dimensional matrix is positive or non-negative irreducible, then
- The spectrum radius of , , is also a latent root of ;
- There is a positive eigenvector of corresponding to ;
- The eigenfunction of has a single root , i.e., ,
Although the rank algorithms based on linkage structure break through the limitation of traditional IR technology, they still have some shortcomings:
- Only the authority metric has been taken into account;
- The iterative computation results in bad performance;
- It is difficult to deal with the overflow or underflow problem during iteration;
- Because of the "rank sink problem", the iterative computation may not converge.
The last situation, "rank sink problem", is illustrated in Fig. 1. Consider two web pages that point to each other but to no other page, and suppose there is some web page that points to one of them. During iteration, this loop will accumulate rank but never distribute any rank (since there are no outgoing edges). The loop forms a sort of trap called a "rank sink" . This problem can be analyzed more formally as follows.
Fig. 1. The "rank sink problem".
is a n-dimensional matrix, the graph, is named the adjoint directed graph of .
is a n-dimensional non-negative matrix, so is irreducible if and only if the adjoint directed graph of is strongly connected (there exists a path from u to v for any node u and v in the graph) .
It is obvious that the "rank sink problem" makes Perron-FrobeniusTheorem no longer applicable, so the iteration computation method loses its foundation.
To sum up, there are still many weaknesses of current rank algorithms. Simply inheriting IR technology and merely mining the linkage structure are not enough.
3. Analysis of the Rank Algorithm
The rank function of Web resources can be formally defined as , here R represents the set of relevant Web resources, Q represents the set user's queries. Without loss of generality, we can view R as the relevant Web pages found by the search engine. Given , the bigger , the more valuable r to q, i.e., the more valuable for the user.
If the function rank satisfies:
then it is called a normal rank function. Since all rank functions can be transformed to equivalent normal rank functions, we assume all rank functions are normal in the following discussion.
In our opinion, a rational rank algorithm of Web resources should be multidimensional, at least it should include the following metrics.
The relevance metric means the distance between the content of a Web resource r and a user's query q. It is also the metric used by most search engines. The normalized relevance function can be defined as . As section 2 stated, the method to calculate relevance can be derived from IR technology, based on TF, IDF, word weight, meta-data, etc.
The authority metric means how many Web resources refer to the Web resource r. Moreover, the Web resources referred by higher quality resources should be assigned with higher authority.
The integrativity metric means how many Web resources are pointed by the Web resource r. Moreover, the Web resources points to higher quality resources should be assigned with higher integrativity. A Web resource with high integrativity is just like a good "survey" or "review" style academic paper, it can lead users to valuable information.
The novelty metric means in which degree the Web resource r is different from others, i.e., provide novel information. Analysis of query logs has demonstrated that users are impatient, rarely examining more than the first page of results (usually displaying 7~12 URLs). Hence we hope that the top 30 Web resources will be very representative. Such Web resources should have few direct links between themselves, because they will act as roadmaps so that users can easily follow the links embedded in them to find other valuable resources.
The value of a Web resource r depends on not only the query q, but also the user's nation, age, gender, career, culture, hobby, etc. So there is no absolute best rank function. But we can still evaluate the rank algorithm based on the user's reaction.
Assuming RANK represents the set of all possible rank functions, the user's satisfaction function can be defined as , sat(q, rank) will be proportional to the user's satisfaction of this rank function. Given the Web sources set , without loss of generality, we suppose are decreasingly ordered by their value according to the user's judgement. Define the "reverse order number" of the Web resource under the rank function as
Given the queries set , the average satisfaction function should be
4. The Rank Algorithm Based On Markov Model
The above rank algorithms of the Web resources are all from statistic approach, but this paper presents a user-centered rank algorithm from dynamics viewpoint. In fact, surfing on the Web can be viewed as a dynamic procedure that a user jumps from one Web resource to another. We abstract this surfing procedure as a Markov chain.
Suppose is a series random variables on a finite state space . If the state of only depends on , but nothing with , i.e., for any and positive integers the equation is always true, then is named a finite Markov chain. , for short, represents the probability to transit from the state to in the time t. If the transition probability is independent of t, i.e., for any and any , , then this type of Markov chains are called homogeneous ones. is the transition probability matrix of this homogeneous Markov chain, where , , , .
For a query q, denotes the set of related Web resources found by the search engine. We can use R as the state space (one Web resource corresponds to one state). And then we consider a virtual user surfing on the Web, in time t, he is browsing the Web resource in probability , and will jump to the Web resource in probability . It is in this way, the user's surfing action on the Web can be abstracted as a homogeneous Markov chain. Although this modeling is rather simple and intuitive, we believe that it has grasped the spirit of the surfing procedure.
Definition 4.2A finite Markov chain's distribution vector in time t is , where , , .
A homogeneous Markov chain's behavior can be determined by its initial distribution vector and its transition probability matrix , .
Suppose in time t, the virtual user is stay in state , i.e., browsing the Web resource , then in the next time t+1, he may have the following choices:
- Continue browsing the Web resource ;
- Click a hyperlink in so that jump to a new Web resource;
- Press the "BACK" button in the browser so that return to the last browsing Web resource;
- Select another Web resource from the results of the search engine, R..
Facing each above choice, the virtual user's tendency are measured as , , and separately, where is the relevance function defined in section 2, and are four constants to a specific user, which satisfy the condition , .
The linkage structure graph of the Web resources, , is a directed graph, where V is the set of nodes (), a node corresponds to a resource in R; and E is the set of edges, ,
The tendency matrix for the set of related Web resources, R, is
Note that the tendency matrix here have synthesized the four metrics mentioned above (relevance, authority, integrativity and novelty).
After normalizing the tendency matrix, we get the transition probability matrix in the Markov model for user's surfing procedure on the Web.
The transition probability matrix for the set of related Web resources, R, is
Through Theorem 4.1 and Theorem 4.2, the probability distribution vector in any time t can be calculated easily. Now it is obvious that actually reflects the relative importance, in the user's opinion, of the relevance, authority, integrativity and novelty metrics, So we also call the relevance parameter, authority parameter, integrativity parameter, and novelty parameter.
A homogeneous Markov chain, on the state space , is holomorphic, if for any there exists a positive integer k, the state can transit from to in a positive probability, within k steps.
is a n-dimensional positive matrix,
- is called stochastic when , .
- is called primitive when there exists a positive integer k, (that is, every element in
It is obvious that multiplying several stochastic matrices produces a stochastic matrix, and every positive matrix is sure to be primitive.
A homogeneous Markov chain, on the state space , is holomorphic, if and only if, its transition probability matrix is a primitive stochastic matrix.
Based on Theorem 4.2 and Theorem 4.3, it is easy to discover that the Markov chain cooesponding to the user's surfing procedure is a holomorphic Markov chain.
A holomorphicand homogeneous Markov chain, , with as its state space, as its transition probability matrix, and as its initial distribution vector, will converge to a unique ultimate distribution when , that is
The ultimate (stable) distribution vector, , is the unique solution of
that satisfies , .
Assuming the virtual user is rational and experienced enough, we argue that the probability of browsing the Web resource should be proportional to its worthiness. In practice, we can also find out that a user usually spends more time on the Web resources he cares more. At first, the user has no knowledge about the value of each Web resource, he selects Web resources blindly or randomly. The user becomes more and more experienced while more and more Web resources he browsed, so his judgement on the value of resources becomes more and more accurate. When time goes infinitely, the ultimate probability of browsing each Web resource should reflect the worthiness of each Web resource accurately. This is our main idea.
From Theorem 4.4, we know that the ultimate distribution vector of a holomorphic Markov chain is independent of the initial distribution vector , but totally determined by the transition probability matrix . So given a set of related Web resources, R, we can construct the transition probability matrix through Theorem 4.2, then calculate the ultimate distribution vector based on Theorem 4.4. The ultimate distribution vector is just the rank of Web resources we are looking for. The group of equations in Theorem 4.4 can be solved using "Gaussian method" without any iteration. Some ad-hoc mathematical softwares (such as MatLab) have already provide such capability. Our initial implementation shows that the rank algorithm is efficient and scalable in practice.
The parameters in this rank algorithm () will be different for different user, and can be adjusted for particular needs. These parameters can also be automatically estimated based on the user's surfing history, which is discussed in our another paper. In our experiment, the parameters are initially set as .
From dynamics viewpoint, this paper provides a rank algorithm of Web resources based on a Markov model. The advantages of this rank algorithm are:
- Several metrics (relevance, authority, integrativity and novelty) have been synthesized;
- The result can be calculated efficiently through solving a group of linear equations, without any iteration;
- The parameters can be customized and dynamically adjusted by the user.
Several researchers have pointed out that there is plenty of information buried in the user's bookmark and historic visits log . Now we are investigating how to leverage these information in our rank algorithm. Classifying and clustering the Web resources automatically are also very important to RD in the Web [13,17]. We believe that the Markov model proposed in this paper can also be applied.
The Web can be viewed as a "complex system", and nonlinear dynamics (chaos, fractal, and so on) should be very helpful to manage and make use of the Web. We believe the World Wide Web (WWW) will eventually become an information retrieve tool whoever, whenever, wherever (WWW) you are.
- K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public Web search engines, Computer Networks and ISDN Systems, 30(1998) 379-388.
- P. Bernstein, M. Brodie, S. Ceri, et al. The Asilomar Report on Database Research, Technical Report MSTR-TR-98-57, Microsoft Research, Microsoft Corporation, September 1998.
- S. Brin, L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1998) 107-117.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins. Hypersearching the web. Scientific American, June 1999.
- S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson and J. Kleinberg. Automatic resource compilation by analyzing hyperlinkage structure and associated text. Computer Networks and ISDN Systems, 30(1998) 65-74.
- Y. Chen. Web As A Data Source. Ph.D. Thesis, Department of Computer Science & Engineering, Southeast University, Nanjing, China, 1999.
- J. Cho, H. Garcia-Molina, L. Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998.
- Iosifecu M. Finite Markov Process and Their Applications. Chichester, Wiley, 1980.
- Q.Y. Jiang. Mathematical Model (version 2). High Education Press, Beijing, China, 1993.
- S. Lorence and C.L. Giles. Inquirus, the NECI meta search engine. Computer Networks and ISDN Systems, 30(1998) 95-105.
- M. Marchiori. The quest for correct information on the Web: hyper search engines. Computer Networks and ISDN Systems, 29(1997) 1225-1235.
- M. Marchiori. The limits of Web metadata, and beyond. Computer Networks and ISDN Systems, 30(1998) 1-9.
- S.A. Macskassy, A. Banerjee, B.D. Davison, et al. Human Performance on Clustering Web Pages: A Preliminary Study. In Proc. Of Fourth International Conference on Knowledge Discovery and Data Mining, New York, Aug, 1998.
- G. Pringle, L. Allison and D.L. Dowe. What is a tall poppy among Web pages? Computer Networks and ISDN Systems, 30(1998) 369-377.
- C.M. Rowman. Scalable Internet Resource Discovery: Research Problems and Approaches. Comm. Of the ACM, 1994, 37(8):98-107.
- E. Spertus. ParaSite: mining structural information on the Web. Computer Networks and ISDN Systems, 29(1997) 1205-1215.
- M.R. Wulfekuhler and William F.Punch. Finding salient feature for personal Web page categories. Computer Networks and ISDN Systems, 29(1997) 1147-1156.
- D.Y. Zhu. Mathematical Modeling Cases. Southeast University Press, Nanjing, China, 1999.8.
Dell Zhang was born in July 1976 in Yangzhou, China. He received his bachelor's degree in Computer Science from Southeast University, China, and is now a Ph.D. candidate in Computer Science there. He is a member of ACM and IEEE Computer Society. His main research interests are data mining, information retrieval, and evolutionary computing.
Yisheng Dong is a full professor of Computer Science, and also the dean of the Department of Computer Science & Engineering at Southeast University, China. He graduated in 1965, from Nanjing Institute of Technology, China. His research domain includes database, software engineering and information system.