The Structure of Broad Topics on the Web
Soumen Chakrabarti^{1}  Mukul M. Joshi  Kunal Punera  David M. Pennock  
IIT Bombay  IIT Bombay  IIT Bombay  NEC Research Institute 
Copyright is held by the author/owner(s).
WWW2002, May 711, 2002, Honolulu, Hawaii, USA.
ACM 1581134495/02/0005.
Abstract
The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of contentbased clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of communityspecific crawlers and linkbased ranking systems.
Categories and subject descriptors:
H.5.4 [Information interfaces and presentation]:
Hypertext/hypermedia;
H.5.3 [Information interfaces and presentation]:
Group and Organization Interfaces, Theory and models;
H.1.0 [Information systems]: Models and principles.
General terms: Measurements, experimentation.
Keywords: Social network analysis, Web bibliometry.
(Note: The HTML version of this paper is best viewed using Microsoft Internet Explorer. To view the HTML version using Netscape, add the following line to your ~/.Xdefaults or ~/.Xresources file:
Netscape*documentFonts.charset*adobefontspecific: iso88591
For printing use the PDF version, as browsers may not print the mathematics properly.)
1 Introduction
The Web is an actively evolving social network which brings together myriad topics into a uniform hypertextual medium. Superficially, it looks like a giant graph where nodes are hypertext pages and edges are hyperlinks. A closer look reveals fascinating detail: the textual matter, tag structure, site identity and organization, and (link) affinity to prominent topic directories such as Yahoo! and the Open Directory (DMoz), to name just a few. The best strategies for browsing, searching and foraging for Web information are clearly predicated on our understanding of the social processes which shape the Web. Naturally, the structure and evolution of the Web has been under intensive measurements and modeling in recent years.
1.1 Graphtheoretic models and measurements
With a few notable exceptions, most studies conducted on the Web have focused on its graphtheoretic aspects.
Barabási and Albert [3] proposed a local model for social network evolution based on preferential attachment: nodes with large degree are proportionately more likely to become incident to new links. They applied it to the Web graph, and the model predicted the powerlaw degree distribution quite accurately, except for underestimating the density of lowdegree nodes. This discrepancy was later removed by Pennock et al. [30] by using a linear combination of preferential and random attachment.
Random graph models materialize edges independently with a fixed probability. If the Web were a random graph, large densely connected components and bipartite cores would be extremely unlikely. The Web is not random, and such subgraphs abound on the Web. A small bipartite core is often an indicator of an emerging topic. Kumar et al. [21] mine tens of thousands of such bipartite cores and empirically observed that a large fraction are in fact topically coherent, but the definition of a `topic' was left informal.
Dense bipartite subgraphs are an important indicator of community formation, but they may not be the only one. Flake et al. [17] propose a general definition of a community as a subgraph whose internal link density exceeds the density of connection to nodes outside it by some margin. They use this definition to drive a crawler, starting from exemplary members of a community, and verify that a coherent community graph is collected.
Bröder et al. [7] exposed the largescale structure of the Web graph as having a central, strongly connected core (SCC); a subgraph with directed paths leading into the SCC, a component leading away from the SCC, and relatively isolated tendrils attached to one of the three large subgraphs. These four regions were each about a fourth the size of the Web, which led the authors to call this the ``bowtie'' model of the Web. They also measured interesting properties like the average path lengths between connected nodes and the distribution of in and outdegree. Followup work by Dill et al. [15] showed that subgraphs selected from the Web as per specific criteria (domain restriction, occurrence of keyword, etc.) also appear to often be bowtielike, although the ratio of component sizes vary somewhat. Contentbased criteria used keywords, not topics, and the interaction between topics, or the radius of topical clusters, were not addressed.
1.2 Contentbased locality measurements
A few studies have concentrated on textual content.
Davison pioneered a study [13] over about 100,000 pages sampled from the repository of a research search engine called DISCOWEB. He collected the following kinds of pairs of Web pages:
 Random:
 Two different pages were sampled uniformly at random (uar) from the collection.
 Sibling:
 One page x was picked uar from the collection, and then two distinct random outlinks u and v were selected from among the outlinks of x.
 SameDomain:
 A page u was sampled uar from the collection. A random outlink v was chosen such that u and v were from the same host (identified by name).
 DiffDomain:
 A page u was sampled uar from the collection. A random outlink v was chosen such that u and v were from different hosts (identified by name).
A single starting page u_{0} may be a noisy indicator of semantic similarity with pages later on in the walk, because it may have a limited vocabulary. The pages on a short walk (u_{0},u_{1},,u_{i}) may all be clearly about a given broad topic c, but each page may use a negligible fraction of the vocabulary of c. Thus, it is possible that u_{0}·u_{i} is small, whereas _{c}Pr(c)Pr(u_{0}c)Pr(u_{i}c) (the probability that they are both relevant to the same topic, see §2.2) is quite large.
It is also more informative to estimate the distance at which topical memory is lost, rather than only measure the rate of relevance decay for some fixed starting topics, because there is no reference for how dissimilar u_{0} and u_{i} need to be before we agree that memory of u_{0} has been lost.
We also go beyond the studies of Davison and Menczer in that we analyze the relation between different topics rather than only model similarity between pages in a small neighborhood.
1.3 Our contributions
We bring together two existing experimental techniques to launch a thorough study of topicbased properties of the Web: the ability to classify a Web page into predefined topics using a highspeed automatic classifier, and the ability to draw nearuniform samples from the Web graph using random walks. We use a 482class topic taxonomy from DMoz (http://dmoz.org/) and a sampling and classifying technique that we will describe in §2. By obtaining evidence that our samples are faithful, we avoid processing large Web crawls, although even our sampling experiments have fetched almost 16 million pages.
Our study reveals the following properties about communities of broad topics in the Web graph.
Convergence of topic distribution on undirected random walks: Algorithms for sampling Web pages uar have been evaluated on structural properties such as degree distributions [2,32]. Extending these techniques, we design a certain undirected random walk (i.e., assuming hyperlinks are bidirectional) to estimate the distribution of Web pages w.r.t. the Dmoz topics (§3). We start from drastically different topics, and as we strike out longer and longer random paths, the topic distribution of pages collected from these paths start to converge (§3.1). This gives us strong circumstantial evidence that the Web has a welldefined background topic distribution, even though we cannot directly measure the fidelity of our distribution w.r.t. the `whole' Web.
Degree distribution restricted to topics: Once we have some confidence in collecting samples which are faithful to the Web's topic distribution, we can measure the degree distribution of pages belonging to each topicbased community. We observe (§3.4) that the degree distribution for many topics follow the power law, just like the global distribution reported by Bröder and others. We offer a heuristic explanation for this observation.
How topicbiased are breadthfirst crawls? Several production crawlers follow an approximate breadthfirst strategy. A breadthfirst crawler was used to build the Connectivity Server [5,7] at Alta Vista. Najork and Weiner [27] report that a breadthfirst crawl visits pages with high PageRank early, a valuable property for a search engine. A crawl of over 80 million pages at the NEC Research Institute broadly follows a breadthfirst policy. However, can we be sure that a breadthfirst crawl is faithful to the Web's topic distribution? We observe that any bias in the seed URLs persists up to a few links away, but the fidelity does get slowly better (§3.2).
Representation of topics in Web directories: Although we use Web directories to derive our system of topics, they need not have fair representation of topics on the Web. By studying the difference between the directory and background distributions (§3.3), we can spot Web topics and communities that have an unexpectedly low (or high) representation on the directory, and try to understand why.
Topic convergence on directed walks: We also study (§4) page samples collected from ordinary random walks that only follow hyperlinks in the forward direction [18]. We discover that these ordinary walks do not lose the starting topic memory as quickly as undirected walks, and they do not approach the background distribution either. Different communities lose the topic memory at different rates. These phenomena give us valuable insight into the success of focused crawlers [11,14,31] and the effect of topical clusters on Google's PageRank algorithm [6,28].
Linkbased vs. contentbased Web communities: We extend the above measurements to construct a topic citation matrix in which entry (i,j) represents the probability that a page about topic i cites a page about topic j (§5). The topic citation matrix has many uses that we will discuss later. Typically, if the topics are chosen well and if there is topical clustering on the Web, we expect to see heavy diagonals in the matrix and small offdiagonal probabilities. Prominent offdiagonal entries signify human judgment that two topics adjudged different by the taxonomy builders are found to have connections endorsed by Web content writers. Such observations may guide the taxonomy builders to improve the structure of their taxonomies.
1.4 Limitations
A fixed taxonomy with coarsegrained topics has no hope of capturing all possible information needs of Web users, and new topics emerge on the Web all the time. Another concern is the potentially low accuracy of a stateoftheart text classifier on even such broad topics. Despite these limitations, we believe that data collected using a fixed classification system and an imperfect classifier can still be valuable.
We argue that new topics and communities, while numerous, are almost always extensions and specializations of existing broad classes. Topics at and near the top levels of Yahoo! and Dmoz change very rarely, if at all.
As long as the automatic classifier gives us a significant boost in accuracy beyond random choice, we need not be overly concerned about the absolute accuracy (although larger accuracy is obviously better). By using a heldout data set, we can estimate which topics are frequently confused by our classifier, and suitably discount such errors when interpreting collected data.
Recent hypertext classification algorithms [8] may reduce the error by using hyperlink features, but this may produce misleading numbers: first the classifier uses link structure to estimate classes, then we correlate the class with link structurethe results are likely to show artificial coupling between text and link features.
A classifier also acts as a dimensionality reduction device and makes the collected data more dense. Since we use a few hundred classes but Web documents collectively use millions of term features, estimating class distributions is easier than estimating term distributions.
2 Building blocks
2.1 Sampling Web pages
Sampling approximately uniformly at random (uar) is a key enabling technology that we use in this work, and we review sampling techniques in some detail here. The Web graph is practically infinite, because URLs can embed CGI queries and servers can map staticlooking URLs to dynamic content. We can discuss sample quality only in the context of a finite, static graph. A small set of strict rules often suffices to make the Web graph effectively finite, even if unspeakably large. E.g.,
 URLs with substrings in the following list are disallowed: ?, cgibin, &
 URLs with more than some maximum number of path components (counted by slashes) are disallowed
 URLs are permitted to have some maximum number of characters.
PageRankbased random walk: Henzinger et al. [18] proposed an early approach to this problem using PageRank [6,28]. The PageRank prestige p(u) of a node u in G is defined as the relative rate of returning to node u during an infinite random walk on G, where each step from a node v is taken as follows: with probability d (0 < d < 1) we jump to a random node in G, and with probability 1d we follow an outlink from v uar. The steady state distribution p of this walk is called the PageRank. Henzinger et al. first performed a PageRankstyle walk for some steps, and then corrected the bias by sampling the visited nodes with probability inversely proportion to their p scores. Rusmevichientong et al. [32] have enhanced this algorithm and proved that uniform sampling is achieved asymptotically.
The BarYossef random walk: An alternative to the biased walk followed by the correction is to modify the graph so that the walk itself becomes unbiased. BarYossef et al. [2] achieve this by turning the Web graph into an undirected, regular^{2} graph, for which the PageRank vector is known to have identical values for all nodes. The links are made undirected by using the link:... backlink query facility given by search engines. This strategy parasites on other people having crawled large sections of the Web to find the backlinks, and is not guaranteed to find all backlinks. This will introduce some initial bias in the sample towards pages close to the starting point of the walk. Unfortunately, there is no easy way around this bias until and unless hyperlinks become bidirectional entities on the Web [9]. However we can assess the quality of the samples through other means. The graph is made regular by adding sufficient numbers of selfloops at each node; see §3.
We use a variant of BarYossef's walk, with random jumps thrown in with a small probability, which in our empirical experience gave us faster convergence. We call this the SAMPLING walk, whereas the PageRank walk with d = 0 is called the WANDER walk.
2.2 Taxonomy design and document classification
We downloaded from the Open Directory (http://dmoz.org) an RDF file with over 271,954 topics arranged in a tree hierarchy with depth at least 6, containing a total of about 1,697,266 sample URLs. Since the set of topics was very large and many topics had scarce training data, we pruned the Dmoz tree to a manageable frontier as described in §3.1 of our companion paper in these proceedings [10]. The pruned taxonomy had 482 leaf nodes and a total of 144,859 sample URLs. Out of these we could successfully fetch about 120,000 URLs.
For the classifier we used the public domain BOW toolkit and the Rainbow naive Bayes (NB) classifier created by McCallum and others [23]. Bow and Rainbow are very fast C implementations which let us classify pages in real time as they were being crawled. Rainbow's naive Bayes learner is given a set of training documents, each labeled with one of a finite set of classes/topics. A document d is modeled as a multiset or bag of words, {t, n(d,t)} where t is a term/word/token/feature. The prior probability Pr(c) is the fraction of training documents labeled with class c. The NB model is parameterized by a set of numbers q_{c,t} which is roughly the rate of occurrence of term t in class c, more exactly, (1+_{d Dc}n(d,t))/(T+_{d,t}n(d,t)), where D_{c} is the set of documents labeled with c and T is the entire vocabulary. The NB learner assumes independence between features, and estimates

3 Background topic distribution
In this section we seek to characterize and estimate the distribution of topics on the Web, i.e., the fractions of Web pages relevant to a set of given topics which we assume to cover all Web content.
We will use only SAMPLING walks here; WANDER walks will be used in §4. BarYossef et al.'s random walk has a few limitations. Apart from the usual bias towards high indegree node and nodes close to the starting point, it may find convergence elusive in practice if snared in densely linked clusters with a sparse egress. One example of such a graph is the ``lollipop graph'' which is a completelyconnected kclique with a `stick' of length k dangling from one node belonging to the clique. It is easy to see that a walk starting on the stick is doomed to enter the clique with high probability, whereas getting out from the clique to the stick will take a long, long time [26].
Unfortunately, lollipops and nearlollipops are not hard to find on the Web: http://www.amazon.com/, http://www.stadions.dk/ and http://www.chipcenter.com/ are some prominent examples. Hence we add a PageRankstyle random jump parameter to the original BarYossef algorithm [2], set to 0.010.05 throughout our experiments, i.e., with this probability at every step, we jump uar to a node visited earlier in the SAMPLING walk. Berg confirms [4] that this improves the stability and convergence of the SAMPLING walks.
3.1 Convergence
BarYossef et al. showed that the samples collected by a SAMPLING walk have degree distributions that converge quickly (within a few hundred distinct page fetches) to the ``global'' degree distribution (obtained from the Internet Archive). It would be interesting to see if and how convergence properties generalize to other page attributes, such as topics.
Our basic experimental setup starts with two URLs u_{0} and v_{0}, generally about very different topics. We now execute the following steps:
 Perform separate SAMPLING walks starting from each of them. In our implementation we do not model the selfloops for efficiency, so we get a what we call a ``physical walk'' where successive URLs are generally different.
 We estimate an upper bound to the maximum degree M of any node in the Web. We can pick a very loose upper bound, such as 10,000,000this will increase the number of selfloops, but have no effect on the efficiency or the quality of the sample. The selfloop probability of a node u with total (in+out) degree d_{u} is set to 1d_{u}/M.
 We turn the physical walk into a ``virtual walk'' by selflooping at each URL u for a random number of times, distributed as a geometric random variable with mean d_{u}/M. (See Figure 1.) A geometric random variable X with mean q has value x with probability Pr(X = x) = q(1q)^{x1}, for x = 1,2,.
 Ideally, we should use one SAMPLING walk for collecting each sample page (i.e., keep only the last page reached in every SAMPLING walk), but walking is expensive (mainly because of backlink queries which need to be polite to the search service). Therefore we pick a sufficiently large (virtual) stride k over which we hope ``memory is lost'' on the virtual walk, and collect URL samples every k hops from the virtual walk. We play with a range of ks to guard against too small a choice.
In our first experiment, we pick some 20 topics from our 482topic Dmoz collection and one representative URL from each topic as a starting point for a SAMPLING walk. Each page visited in a walk is classified using Rainbow and its class histogram as well as in and outneighbors stored in a relational database. Then we consider pairs of walks, turn them into virtual walks and sample at a given stride on the fly.
Suppose we have thus collected two sets of documents D_{1} and D_{2}. Each document d has a class probability vector p(d) = (Pr(cd)"c), where _{c}Pr(cd) = 1. E.g., if we have only two topics /Arts and /Sports, a document may be represented by the vector (0.9,0.1). The topic distribution of D_{1} (likewise, D_{2}) is simply the average.

We characterize the difference between p(D_{1}) and p(D_{2}) as the L_{1} difference between the two vectors, _{c}p_{c}(D_{1})p_{c}(D_{2}). This difference ranges between 0 and 2 for any two probability vectors over the classes. We had to avoid the use of the more wellknown KullbackLeibler (KL) divergence

BarYossef et al. found that an undirected random walk touching about 300 physically distinct pages was adequate to collect a URL sufficiently unbiased to yield a good degree distribution estimate. We use this number as a guideline to set the virtual hop size so that at least this number of physical pages will be skipped from one sample to the next.
Figure 2 shows pairwise class distribution differences starting from a few pairs of very dissimilar topics; the xaxis plots the number of samples drawn from the respective virtual walks, and different curves are plotted for different virtual hop sizes. We never seem to get complete convergence (distance zero) but the distance does rapidly reduce from a high of over 1 to a low of about 0.19 within 10001500 virtual hops, which typically includes about 300400 distinct physical pages. A larger virtual stride size leads to a slightly faster rate of convergence, but curiously, all pairs converge within the same ballpark number of virtual strides.
Convergence of this nature is a stronger property than just a decay in the similarity between u_{0} and u_{i} on a walk (u_{0},u_{1},) as i increases. It indicates that there is a welldefined background topic distribution and we are being able to approach it with suitably long SAMPLING walks.
Obviously, the rate at which the topic probability vectors converge depends on the granularity of the topic specification, if only because it will take a larger number of documents to fill up a larger number of topic buckets adequately to make a reliable reading. We repeat the above experiment with a coarser version of Dmoz which is obtained by lifting all the 482 leaf classes of our Dmoz topic set to their immediate parents. Figure 3 shows the results. Because topic bins are populated more easily now, the distance is already small to start with, about 0.2, and this decreases to 0.05. But the number of virtual hops required is surprisingly resilient, again within the 10001500 range.
There are some anomalous drops in distance at a small number of virtual hops in the graphs mentioned above. This is because we did not preserve the page contents during the walks but refetched them later for pages classification. Some page fetches at small hop count timed out, leading to the instability. For larger hop counts the fraction of timeouts was very small and the result became more stable.
In Figure 4, we show an estimate of the background distribution of Web pages into the 12 toplevel topics in our taxonomy. Computerrelated topics lead the show. This is completely understandable (if only because of rampant mirroring of software manuals).
3.2 The background distribution vs breadthfirst crawls
Many production crawlers follow an approximate breadthfirst link exploration strategy which gives them some basic robustness against overloading a small set of servers or going depthfirst into a bottomless ``spider pit''. The crawler which populates Alta Vista's Connectivity Servers follows largely a breadthfirst strategy [5,7]. Najork and Weiner [27] have demonstrated that a breadthfirst crawler also tends to visit nodes with large PageRank early (because good authorities tend to be connected from many places by short paths). A crawler of substantial scale deployed in NEC Research uses breadthfirst scheduling as well.
Level  Sampled  Unique  Fetched 
0  50000  49274  23276 
1  500000  491893  45839 
2  5000000  4857239  109544 
The NEC crawl was started from URLs taken from Dmoz. These URLs were placed in level zero. We collected URL samples from levels 0, 1 and 2. Details are shown in Figure 5. Figure 6(a) shows the pairwise distance between the three levels of the NEC crawl. The distances are fairly small, which indicates that the aggregate class distribution drifts quite slowly as one strikes out from the seed set. Therefore any significant bias in the seed set will persist for quite a few levels, until the frontier size approaches a sizable fraction of the reachable Web.
(a) 
 
(b) 

Figure 6(b) shows the distance of the NEC collections at the three levels and our approximation to the background topic distribution. We see some significant distance between NEC collections and the background distribution, again suggesting that the NEC topic distributions carry some bias from the seed set. The bias drops visibly from level 0 to level 1 and then rises very slightly in level 2. It would be of interest to conduct larger experiments with more levels.
Topics OVERrepresented in Dmoz 
compared to the background 
Games.Video_Games.Genres 
Society.People 
Arts.Celebrities 
Reference.Education.Colleges_and_Universities.North_America... 
Recreation.Travel.Reservations.Lodging 
Society.Religion_and_Spirituality.Christianity.Others 
Arts.Music.Others 
Reference.Others 
Topics UNDERrepresented in Dmoz 
compared to the background 
Computers.Data_Formats.Markup_Languages 
Computers.Internet.WWW.Searching_the_Web.Directories 
Sports.Hockey.Others 
Society.Philosophy.Philosophers 
Shopping.Entertainment.Recordings 
Reference.Education.K_through_12.Others 
Recreation.Outdoors.Camping 
3.3 Faithful representation of topics in Web directories
Many Web users implicitly expect topic directories to be a microcosm of the Web itself, in that pages of all topics are expected to be represented in a fair manner. Reality is more complex, and commercial interests play an important role in biasing the distribution of content cited from a topic directory. Armed with our sampling and classification system, we can easily make judgments about the biases in topic directories, and locate topics which are represented out of proportion, one way or another.
The L_{1} distance between the Dmoz collection and the background topic distribution is quite high, 1.43, which seems to indicate that the Dmoz sample is highly topic biased. Where are the biases? We show some of the largest deviations in Figure 7. In continuing work we are testing which among these are statistically significant. This is not a direct statement about DMoz, because our sample of DMoz differs from its original topic composition. Overrepresentation could be caused by our sampling, but the underrepresented topics are probably likewise underrepresented in DMoz. In any case, it is clear that comparisons with the background distribution gives us a handle on measuring the representativeness of topic directories.
3.4 Topicspecific degree distributions
Several researchers have corroborated that the distribution of degrees of nodes in the Web graph (and many social networks in general [16,34]) asymptotically follow a power law distribution [1,7,21]: the probability that a randomly picked node has degree i is proportional to 1/i^{x}, for some constant `power' x > 1. The powers x for in and outdegrees were estimated in 1999 to be about 2.1 and 2.7, respectively, though the fit breaks down at small i, especially for outdegrees [7]. Barabási and Albert [3] gave an early generative model called ``preferential attachment'' and an analysis for why millions of autonomous hyperlinking decisions distributed all over the Web could lead to a powerlaw distribution. Dill et al. [15] showed that subgraphs selected from the Web as per specific criteria (domain restriction, occurrence of a given keyword, etc.) also appear to satisfy powerlaw degree distributions. Pennock et al. [30] found that certain topicspecific subsets of the web diverge markedly from powerlaw behavior at small i, though still converge to a power law for large i; the authors explain these observed distributions using an extension of Barabási's model.
We conduct more general tests for power law behavior across a broad array of topics: if we fix a topic and measure the degree of nodes relevant to that topic, will the resulting degree distribution also follow a power law? We can use the same softcounting technique to answer this question. Using a SAMPLING walk, we derive a sample D of pages. If a page d has degree D_{d} and class vector p(d), it contributes a degree of D_{d} p_{c}(d) to class c. Note that D_{d} includes all links incident on d.
Figure 8 shows that topicspecific degree distributions also follow the power law over many orders of magnitude. We cannot explain this by claiming that ``social networking within a topic mimics social networking on the Web at large'', because D_{d} includes offtopic links, for which there is no known analysis of social networking processes.
Loglog plots of degree distributions for various topic look strikingly similar at first glance, but a closer examination shows that, for example, pages about Web directories generally have larger degree than pages about hockey, which matches our knowledge of the Web. We observe that the degree distribution restricted to members of a specific topic have a power law tail, but with a significant divergence from power law at small numbers of links, in agreement with the Pennock et al. findings [30], and in contrast to the global indegree distribution which is nearly a pure power law [7].
An empirical result of Palmer and Steffan [29] may help explain why we would expect to see the power law upheld by pages on specific topics. They showed through experiments that the following simple ``8020'' random graph generator fits powerlaw degree distributions quite well:
 Assume for simplicity that the number of nodes N is a power of 2, and M edges are desired.
 Partition the graph into two node sets V_{1} and V_{2} each of size N/2. Let fraction p_{ij} of edges go from some node in V_{i} to some node in V_{j}, i,j = 1,2. P = (p_{ij}) are provided as parameters. The idea is to favor intracommunity links by having p_{11} and p_{22} larger than p_{12} or p_{21}, hence the name ``8020''.
 Using P, pick one of the four possible types of edges.
 Recurse within the 1/4 of edges that are consistent with the above choice; continue recursing with the same parameters P until a specific edge is materialized (i.e., until the number of remaining nodes equals 1 or 2).
 Repeat until M edges are added to the graph.
4 Topical locality and linkbased prestige ranking
In this section we use the WANDER walk to see how fast the memory of the topic of a starting page fades as we take random forward steps along HREFs. (No backlinks or selfloops are permitted.) It is actually quite difficult in practice to sustain forward walks. Figure 9 shows that if we start a large set of WANDER walks, very few survive with each additional hop, owing to no outlink, broken outlinks, or server timeout. Note that this experiment is different from the study of the NEC crawler, because here, only one random outlink is explored from each page, whereas the NEC crawler tried to fetch as many outlinks in every subsequent level as possible.
4.1 Experiments and results
At this point we have a reference background topic distribution. We consider several classes in DMoz. Because forward walks die fast, we picked some Dmoz classes which were very wellpopulated, say with more than 10000 URLs. We start one WANDER walk from each URL. The starting pages are at distance 0. For each topic, we collect all the pages D_{i} found at distance i, i 0, and find the soft class histogram p(D_{i}) as before. Next we find the distance between p(D_{i}) and our precomputed baseline, as well as the distance between p(D_{i}) and p(D_{0}) to monitor the drift rate.
Figure 10 shows the results for four topics as starting points. Because all the starting points in each group were taken from a specific topic, the topic histogram at distance 0 is quite dissimilar from the background. Even 20 hops seem inadequate to bring the distribution closer to the background. But this is not because the walks stay perfectly ontopic. They clearly do start drifting away, but not too badly within the first 510 hops. Thus it is clear that a samplingtype walk is critical to topic convergence. The rate of drift away from the starting topic also varies from topic to topic: `Soccer' seems to be very driftprone whereas `Photography' drifts much less. Such prior estimates would be valuable to a focused crawler, and explain in part why focused crawling along forward links is already quite successful [11]. We also confirm our intuition that our estimate of drift w.r.t. broad topics seems generally lower than what Davison and Menczer have characterized in terms of cosine similarity with the starting point.
4.2 Implications
Two families of popularitybased ranking of Web pages have been very successful in recent years. Google (http://google.com) uses as a subroutine the PageRank algorithm [6,28], which we have already reviewed in §2.1. Kleinberg proposed the HITS algorithm [20] which has many variants.
Unlike PageRank, HITS does not analyze the whole Web graph, but collects a subgraph G_{q} = (V_{q},E_{q}) of the Web graph G in response to a specific query q. It uses a keyword search engine to collect a root set R_{q}, includes R_{q} in G_{q} and then further includes in G_{q} any node linked from or linking to some node in R_{q} (a radiusone link expansion). In matrix terms, we can reuse E_{q} to denote the node adjacency matrix: E_{q}(i,j) = 1 if i links to j, and 0 otherwise. HITS assigns two scores a(v) and h(v) with each node, reflecting its authority and its hubness (the property of compiling a number of links to good authorities), and solves the following mutually recursive equations iteratively, scaling a and h to unit norm every step:

It has been known that the radiusone expansion improves recall; good hubs and authorities which do not have keyword matches with the query may be drawn into G_{q} this way. Some irrelevant pages will be included as well, but our experiments confirm that at radius one the loss of precision is not devastating. Davison and Menczer [13,25] have proposed that topical locality is what saves G_{q} from drifting too much, but they did not use a reference set of topics to make that judgment.
Unlike HITS, PageRank is a global computation on the Web graph, which means it assigns a queryindependent prestige score to each node. This is faster than collecting and analyzing queryspecific graphs, but researchers have hinted that the queryspecific subgraphs should lead to more faithful scores of authority [20].
PageRank has a random jump parameter d, which is empirically set to about 0.150.2. This means that typically, every sixth to eighth step, the random surfer jumps rather than walks to an outneighbor, i.e., the surfer traces a path of typical length 68 before abandoning the path and starting afresh. The key observation is this: if topic drift is small on such short directed random paths, as Figure 10 seems to indicate, the global nature of the PageRank computation does not hurt, because endorsement to any node (apart from jumping to it uar, which treats all nodes equally) comes from a small neighborhood which is topically homogeneous anyway!
All this may explain why PageRank is an acceptable measure of prestige w.r.t. any query, in spite of being a global measure. Confirming this intuition will not be easy. Perhaps one can build synthetic graphs, or extract large graphs from the Web which span multiple topic communities, and tweak the jump probability d so that the average interval between jumps ranges from much less than directed mixing radius to much more that mixing radius, and see if the largest components of the PageRank vector leaves a variety of topic communities to concentrate in some dominant communities featuring ``universal authorities'' such as http://www.yahoo.com/, http://www.netscape.com/, http://www.microsoft.com/windows/ie/, or http://www.adobe.com/prodindex/acrobat/readstep.html.
5 Relations between topics
According to the ordinary graph model for Web pages, a link or edge connects two nodes which have only graphtheoretic properties such as degree. Given our interest in the content of pages, it is natural to extend our view of a link as connecting two topics. The topic of a potential link target page v is clearly the single most important reason why the author of another page u may be motivated to link from u to v.
This view is clearly missing from preferential attachment theory, where the author picks targets with probability proportional to their current indegree, regardless of content or topic. It would be very interesting to see a general, more realistic theory that folds some form topic affinity with preferential attachment and matches our observations.
We model topic affinity using a topic citation matrix, which is constructed using softcounting as follows:
 Suppose there are N topics at the desired level of detail. Initialize a N×N topic citation matrix C with all zeroes.

Repeat the following steps as long as the estimate C `improves'.
(E.g., if l links have been sampled, we want C/l to
converge.)
 Sample a page u nearly uar from the Web using the SAMPLING walk and the virtual hop technique.
 Sample an outlink v of u uar.
 Fetch and classify u and v, getting class probability row vectors p(u) and p(v).
 For every position (i,j) in C, increase C(i,j) by p_{i}(u) p_{j}(v). In matrix terms, assign C C + p(u)^{T} p(v).
(a) Raw citation  (b) Confusionadjusted 
5.1 Experimental results
We experimented with our 482leaf taxonomy at two levels of detail: the top level with 12 topics and the third level with 191 topics. This is partly because data at the 482×482 level was very sparse. In Figure 11(a), we show the 12class toplevel citation matrix. Dark colors (in the HTML version, hot colors) show higher probabilities. The diagonal is clearly dominant, which means that there is a great deal of selfcitation among topics. This natural social phenomenon explains the success of systems like HITS and focused crawlers. It is also worthwhile to note that the matrix is markedly asymmetric, meaning that communities do not reciprocate in crosslinking behavior.
Apart from the prominent diagonal, there are two horizontal bands corresponding to /Computers and /Society, which means that pages relevant to a large variety of topics link to pages about these two topics and their subtopics. Given that these are the two most dominant toplevel topics on the Web (Figure 4), it is conceivable that preferential attachment will lead to such behavior.
Truenormalized  Guessnormalized 
However, these inferences may be specious if it turns out that our classifier is biased in favor of /Computers and /Society when it guesses the class of test documents. To avoid this problem, we use a heldout labeled data set from DMoz to calibrate the classifier. The result is a confusion matrix E where E(i,j) is the number of documents in class i labeled with class j by the classifier. Depending on the application, we can scale either the row (true class) or the column (guessed class) of E to 1. In either case, an ideal classifier's confusion matrix will be the identity matrix. We show both scaled versions E_{t} and E_{g} in Figure 12. Although the color scales were designed for maximum contrast, we note that the diagonal elements are generally large, hinting that the classifier is doing well.
The entries corresponding to guessed class j in the ``guessnormalized'' version E_{g} may be regarded as the empirical Pr(true = iguess = j) for all i. Information from E_{g} can be folded into the soft counting process as follows. After we fetch and classify u and v, getting class probability row vectors p(u) and p(v), we find the corrected probability vector p^{*}(u) (similarly, p^{*}(v)) by computing the corrected probability that u `truly' belongs to class i as

Returning to the citation matrix, we present the corrected citation matrix in Figure 11(b). Generally speaking, the correction smears out the probability mass slightly, but the corrected data continues to show a higher than average rate of linkage to documents about /Computers and /Society.
We can now extend the experiments to more detailed levels in the taxonomy. Figure 13 shows the citation matrix at the third level of our DMoz sample, with 191 topics. This diagram is a drilldown into the earlier 12class citation diagram, and we see a telltale blockdiagonals and other blockstructure in the larger matrix, which align with the broad 12topic boundaries. (They are not all equally wide, because the toplevel topics have different numbers of descendants.)
The diagonal remains dominant, which is good news for topic communities and focused crawling. Finer horizontal lines emerge, showing us the most popular subtopics under the popular broad topics. Zooming down into /Arts, we find the most prominent bands are at /Arts/Music, /Arts/Literature and /Arts/Movies. Within /Computers, we find a deep, sharp line about a fourth of the way down, at /Computers/DataFormats. This is partly an artifact of badly written HTML which confounded our libxml2 HTML parser, making HTML tags part of the classified text. Other, more meaningful target bands are found at /Computers/Security, /Recreation/Outdoors, and /Society/Issues. We found many meaningful isolated hotspots, such as from /Arts/Music to /Shopping/Music and /Shopping/Entertainment/Recordings.
Because the `height' at various pixels is not very clear from a 2d rendering, we also show a 3d contourplot of the 191×191 citation matrix in Figure 14. Although the excessive smoothing and the slope introduced between adjacent pixels is artificial, the contourplot does reveal how strongly selfciting the topics are, even at the third level of DMoz.
5.2 Implications and applications
The topic citation matrix, similar in spirit to a spectrogram, is fascinating in its own right in showing the strongest (singlelink) intertopic connections on the Web, but it also has a variety of practical applications. We touch upon a few here.
Improved hypertext classification: Standard Bayesian text classifiers build a classconditional estimate of document probability, Pr(dc), in terms of the textual tokens appearing in d. Pages on the Web are not isolated entities, and the (estimated) topics of the neighbors N(u) of page u may lend valuable evidence to estimate the topic of u [8,19]. Thus we need to estimate a joint distribution for Pr(c(u),c(N(u))), which is a direct application of the topic citation matrix.
Enhanced focused crawling: A focused crawler is given a topic taxonomy, a trained topic classifier, and sample URLs from a specific topic. The goal is to augment this set of relevant URLs while crawling as few irrelevant pages as possible. Currently, focused crawlers use the following policies:
 If page u is relevant, an outlink v is also likely to be relevant [11].
 Given the text of a page u, we can estimate the link distance to a relevant goal node [14].
 Given features from near outlink (u,v) on page u, estimate the `reward' that may be accrued by following the link [10,31].
Reorganizing topic directories: We had to discard the /Regional and /News subtrees, not only because of severe classification error, but also because of overwhelming interlinkage between these and other topics. /News is almost always about some other topic, such as /Sports or /Science. The structure within the /Regional or even the /Shopping subtree mirrors many broad topics outside. E.g., mountain biking fits under recreation and sports, but to buy biking gear, one must move up and then descend down the /Shopping path. A tree is a rather inadequate structure to connect areas of human endeavor and thought, and such artificial structures have compelled Yahoo! to include a number of ``soft links'' connecting arbitrary points in their topic `tree'.
The limitations of a tree representation are responsible for many longrange offdiagonal elements in the topic citation matrix, and arguably makes directory browsing less intuitive. It may also make focused crawling and automatic cataloging more difficult. We claim that such phenomena warrant careful consideration of taxonomy inversion and better metadata annotation. E.g., let commercial interests related to biking be contained in the subtree dedicated to biking, rather than collect diverse commercial interests together. We can envisage advanced user interfaces through which taxonomy editors can point and click on a topic citation map or a confusion matrix to reorganize and improve the design of a taxonomy.
6 Concluding remarks
The geography of the Web, delineating communities and their boundaries in a state of continual flux, is a fascinating source of data for social network analysis (see, e.g., http://www.cybergeography.org/). In this paper we have initiated an exploration of the terrain of broad topics in the Web graph, and characterized some important notions of topical locality on the Web. Specifically, we have shown how to estimate the background distribution of broad topics on the Web, how pages relevant to these topics cite each other, and how soon a random path starting from a given topic `loses' itself into the background distribution.
We believe this work barely scratches the surface w.r.t. a new, contentrich characterization of the Web, and opens up many questions, some of which we list below.
PageRank jump parameter: How should one set the jump probability in PageRank? Is it useful to set topicspecific jump probabilities? Does an understanding of mixing radius help us set better jump probabilities? Is there a useful middle ground between PageRank's precomputed scores and HITS's runtime graph collection?
Topical stability of distillation algorithms: How can we propose models of HITS and stochastic variants that are contentcognizant? Can contentguided random walks be used to define what a focused crawler should visit and/or collect? Can we validate this definition (or proposal) on synthetic graphs? Can such a theory, coupled with our measurements of topic linkage, predict and help avoid topic drift in distillation algorithms?
Better crawling algorithms: Given that we can measure mixing distances and intertopic linkage, can we develop smarter federations of crawlers in which each concentrates on a collection of tightly knit topics? Can this lead to better and fresher coverage of small communities? Can we exploit the fact that degrees follow power laws both globally and locally within topic contexts to derive better, less topicbiased samples of URLs from the Web?
Acknowledgments: Thanks to Alice Zheng and David Lewis for helpful discussions, to Kim Patch for bringing Menczer's recent measurements to our notice, to Gary W. Flake for providing the NEC crawl data, and for his help with our code. We also thank the referees for helpful comments.
References
 [1]

L. A. Adamic and B. A. Huberman.
Powerlaw distribution of the World Wide Web.
Science, 287:2115a, 2000.
 [2]

Z. BarYossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz.
Approximating aggregate queries about Web pages via random walks.
In Proceedings of the 26th International Conference on Very
Large Databases (VLDB), pages 535544, 2000.
Online at
http://www.cs.berkeley.edu/~zivi/papers/webwalker/webwalker.ps.gz.
 [3]

A.L. Barabási and R. Albert.
Emergence of scaling in random networks.
Science, 286:509512, 1999.
 [4]

A. Berg.
Random jumps in WEBWALKER.
Personal communication, Apr. 2001.
 [5]

K. Bharat, A. Bröder, M. Henzinger, P. Kumar, and S. Venkatasubramanian.
The Connectivity Server: Fast access to linkage information on the
Web.
In 7th World Wide Web Conference, Brisbane, Australia, 1998.
Online at
http://www.research.digital.com/SRC/personal/Andrei_Broder/cserv/386.html and http://decweb.ethz.ch/WWW7/1938/com1938.htm.
 [6]

S. Brin and L. Page.
The anatomy of a largescale hypertextual web search engine.
In Proceedings of the 7th WorldWide Web Conference (WWW7),
1998.
Online at http://decweb.ethz.ch/WWW7/1921/com1921.htm.
 [7]

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,
A. Tomkins, and J. Wiener.
Graph structure in the Web: Experiments and models.
In WWW9, pages 309320, Amsterdam, May 2000. Elsevier.
Online at http://www9.org/w9cdrom/160/160.html.
 [8]

S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using hyperlinks.
In SIGMOD Conference. ACM, 1998.
Online at http://www.cs.berkeley.edu/~soumen/sigmod98.ps.
 [9]

S. Chakrabarti, D. A. Gibson, and K. S. McCurley.
Surfing the Web backwards.
In WWW, volume 8, Toronto, Canada, May 1999.
Online at http://www8.org.
 [10]

S. Chakrabarti, K. Punera, and M. Subramanyam.
Accelerated focused crawling through online relevance feedback.
In WWW, Hawaii, May 2002. ACM.
 [11]

S. Chakrabarti, M. van den Berg, and B. Dom.
Focused crawling: a new approach to topicspecific web resource
discovery.
Computer Networks, 31:16231640, 1999.
First appeared in the 8th International World Wide Web Conference,
Toronto, May 1999. Available online at
http://www8.org/w8papers/5asearchquery/crawling/index.html.
 [12]

T. M. Cover and J. A. Thomas.
Elements of Information Theory.
John Wiley and Sons, Inc., 1991.
 [13]

B. D. Davison.
Topical locality in the Web.
In Proceedings of the 23rd Annual International Conference on
Research and Development in Information Retrieval (SIGIR 2000), pages
272279, Athens, Greece, July 2000. ACM.
Online at http://www.cs.rutgers.edu/~davison/pubs/2000/sigir/.
 [14]

M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori.
Focused crawling using context graphs.
In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel,
G. Schlageter, and K.Y. Whang, editors, VLDB 2000, Proceedings of 26th
International Conference on Very Large Data Bases, September 1014, 2000,
Cairo, Egypt, pages 527534. Morgan Kaufmann, 2000.
Online at
http://www.neci.nec.com/~lawrence/papers/focusvldb00/focusvldb00.pdf.
 [15]

S. Dill, S. R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and
A. Tomkins.
Selfsimilarity in the Web.
In VLDB, pages 6978, Roma, Sept. 2001.
Online at http://www.almaden.ibm.com/cs/k53/fractal.ps.
 [16]

M. Faloutsos, P. Faloutsos, and C. Faloutsos.
On powerlaw relationships of the internet topology.
In SIGCOMM, pages 251262, 1999.
Online at
http://www.cs.cmu.edu/~christos/PUBLICATIONS/sigcomm99.ps.gz.
 [17]

G. W. Flake, S. Lawrence, C. Lee Giles, and F. M. Coetzee.
Selforganization and identification of Web communities.
IEEE Computer, 35(3):6671, 2002.
See
http://www.neci.nec.com/~lawrence/papers/webcomputer02/bib.html.
 [18]

M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork.
On nearuniform URL sampling.
In WWW9, Amsterdam, May 2000.
Online at http://www9.org/w9cdrom/88/88.html.
 [19]

R. Jin and S. T. Dumais.
Probabilistic combination of content and links.
In SIGIR, pages 402403, New Orleans, sep 2001. ACM.
Online at
http://research.microsoft.com/~sdumais/SIGIR2001LinksRevisedSubmitted.pdf.
 [20]

J. Kleinberg.
Authoritative sources in a hyperlinked environment.
In Proc. ACMSIAM Symposium on Discrete Algorithms, 1998.
Also appears as IBM Research Report RJ 10076(91892), and online at
http://www.cs.cornell.edu/home/kleinber/auth.ps.
 [21]

S. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
Trawling the web for emerging cybercommunities.
WWW8 / Computer Networks, 31(1116):14811493, 1999.
Online at
http://www8.org/w8papers/4asearchmining/trawling/trawling.html.
 [22]

D. D. Lewis.
ATTICS: A toolkit for text classification and text mining.
In M. Grobelnik, D. Mladenic, and N. MilicFrayling, editors,
KDD2000 Workshop on Text Mining, Boston, Aug. 2000. SIGKDD, ACM.
 [23]

A. McCallum.
Bow: A toolkit for statistical language modeling, text retrieval,
classification and clustering.
Software available from http://www.cs.cmu.edu/~mccallum/bow/,
1998.
 [24]

A. McCallum and K. Nigam.
A comparison of event models for naive Bayes text classification.
In AAAI/ICML98 Workshop on Learning for Text Categorization,
pages 4148. AAAI Press, 1998.
Also technical report WS9805, CMU; online at
http://www.cs.cmu.edu/~knigam/papers/multinomialaaaiws98.pdf.
 [25]

F. Menczer.
Links tell us about lexical and semantic Web content.
Technical Report Computer Science Abstract CS.IR/0108004, arXiv.org,
Aug. 2001.
Online at http://arxiv.org/abs/cs.IR/0108004.
 [26]

R. Motwani and P. Raghavan.
Randomized Algorithms.
Cambridge University Press, 1995.
 [27]

M. Najork and J. Weiner.
Breadthfirst search crawling yields highquality pages.
In WWW 10, Hong Kong, May 2001.
Online at http://www10.org/cdrom/papers/208.
 [28]

L. Page, S. Brin, R. Motwani, and T. Winograd.
The PageRank citation ranking: Bringing order to the Web.
Unpublished manuscript, online at
http://google.stanford.edu/~backrub/pageranksub.ps, 1998.
 [29]

C. Palmer and J. Steffan.
Generating network topologies that obey power laws.
In GLOBECOM, Nov. 2000.
Online at http://www.cs.cmu.edu/~steffan/items/globecom.ps.
 [30]

D. M. Pennock, G. W. Flake, S. Lawrence, C. L. Giles, and E. J. Glover.
Winners don't take all: Characterizing the competition for links on
the Web.
Proceedings of the National Academy of Sciences, 2002.
Preprint available
http://www.neci.nec.com/homepages/dpennock/publications.html.
 [31]

J. Rennie and A. McCallum.
Using reinforcement learning to spider the web efficiently.
In ICML, 1999.
Online at
http://www.cs.cmu.edu/~mccallum/papers/rlspidericml99s.ps.gz.
 [32]

P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles.
Methods for sampling pages uniformly from the world wide web.
In AAAI Fall Symposium on Using Uncertainty Within Computation,
pages 121128, 2001.
 [33]

G. Salton and M. J. McGill.
Introduction to Modern Information Retrieval.
McGrawHill, 1983.
 [34]

S. Wasserman and K. Faust.
Social Network Analysis: Methods and Applications.
Cambridge University Press, Cambridge, 1994.
Footnotes:
^{1}Contact author, email soumen@cse.iitb.ac.in
^{2}All
nodes have equal degree in a regular graph.
File translated from T_{E}X by T_{T}H, version 2.79.
On 19 Mar 2002, 20:22.