1 Introduction
In this paper, we present a detailed study of the following problem: given a small but cohesive "seed set" of web pages, expand this set to generate the enclosing community (or communities). This seed expansion problem has been addressed by numerous researchers as an intermediate step of various graphbased analyses on the web. To our knowledge, however, it has never been called out as an interesting primitive in its own right. As we will show, the mathematical structure underlying the problem is rich and fruitful, and allows us to develop algorithms that perform dramatically better than naive approaches.
The seed expansion problem first came into prominence in 2000, when Jon Kleinberg introduced the HITS algorithm [7]. That algorithm used a search engine to generate a seed set, and then performed a fixeddepth neighborhood expansion in order to generate a larger set of pages upon which the HITS algorithm was employed. This general recipe has since seen broad adoption, and is now a common technique for local linkbased analysis. Variants of this technique have been employed in community finding [8], in finding similar pages [1], in variants of HITS, PageRank [13], and TrustRank [5], and in classification [2]. More sophisticated expansions have been applied by Flake et al. [4] in the context of community discovery.
Alongside this body of work in web analysis, the theoretical computer science community has developed algorithms that find cuts of provably small conductance within a carefully expanded neighborhood of a vertex [12]. Intuitively, these methods are efficient because they make nonuniform expansion decisions based on the structure revealed during exploration of the neighborhood surrounding the vertex. Our results are based on these techniques, and we present modified versions of these methods and theorems that apply to the seed set expansion problem.
The actual computation is done by simulating a ``truncated'' random walk for a small number of steps, starting from a distribution concentrated on the seed set. In each step of the truncated walk, probability is spread as usual to neighboring vertices, but is then removed from any vertex with probability below a certain threshold. This bounds the number of vertices with nonzero probability, and implicitly determines which vertices are examined. After each step in the expansion process, we examine a small number of sets determined by the current random walk distribution, and eventually choose one of these sets to be the community for the seed.
1.1 Discussion of seed set expansion
The graphs we consider in our experiments have small diameter and a small average distance between nodes, so the number of nodes within a fixed distance of the seed set grows quickly. Nonetheless, these graphs contain distinct communities with relatively small cutsizes. We assume that the seed set is largely contained in such a community, which we refer to as a target community. In many of the graphs we consider, the seed is contained in a nested sequence of target communities. The results in section 2.5 show that if a good target community exists for the seed, then the random walk method produces some community that is largely contained within this target.
When expanding the seed with a truncated walk, a target community serves as a bottleneck, containing much of the probability for a significant number of steps. Since probability is removed from lowprobability vertices, this prevents the support of the walk from expanding quickly beyond the bottleneck. In contrast, expanding the seed set using a fixeddepth expansion entirely ignores the bottleneck defining the community. Some branch of the BFS tree is likely to cross the bottleneck and rapidly expand in the main graph before a large fraction of the nodes in the community have been reached. Thus a fixed depth expansion will be a bad approximation to the community, and might also be an impractically large ``candidate set'' for further processing.
The goal of examining a small number of nodes is clear enough, but we also need to pin down what we mean by a ``good'' community containing a given seed set. We will do this by describing the limitations of several possible ways to define good communities.
Several papers including [4] have defined a community to be a subgraph bounded by a small cut, which can be obtained by first growing a candidate set using BFS, and then pruning it back using ST maxflow. Unfortunately, a small seed set will often be bounded by a cut that is smaller than the cut bounding the target community, so the mininum cut criterion will not want to grow the seed set. Flake et al. address this problem by performing this process several times, adding nodes from the candidate set at each step to ensure expansion.
Another idea for ensuring a reasonable expansion of the seed set that is less ad hoc than most is to optimize for conductance instead of cutsize alone. Graph conductance (aka the normalized cut metric) is a quotientstyle metric that provides an incentive for growing the seed set. Unfortunately, the conductance score can be improved by adding barely related nodes (or even a disconnected component) to the seed set. We need to ensure that only nearby nodes are added to the seed set.
This raises a subtle but important point. While stronger parametric flow methods do exist for finding lowconductance cuts within an expanded neighborhood of the seed set, our walkbased method's weaker spectralstyle guarantee on conductance is counterbalanced by a valuable ``locality'' property which ensures that we output a community consisting of nodes that are closely related to the seed set. In practice we try to get the best of both worlds by cleaning up the walkbased cuts with a conservative use of flow that does not disturb this locality property very much. Experiments show that the results compare well with stronger optimization techniques.
The random walks techniques are remarkable because they produce communities with conductance guarantees, yet can be computed locally rather than on the whole graph, often while touching fewer nodes than BFS would. The best lowconductance cuts are more likely than minimum cuts to nontrivially expand the seed set, but still ensure a small boundary defining a natural community. Finally, in a certain sense that we will discuss in section 2.5, the added nodes are close to the seed set.
2 Random walk methods
As part of their work on graph partitioning and graph sparsification [12], Spielman and Teng present a method for finding cuts based on the mixing of a random walk starting from a single vertex. In that paper, the method is used as a subroutine to produce balanced separators and multiway partitions.
In this section, we describe how to use the techniques developed by SpiemanTeng to find communities from a seed set, and describe how the size and quality of the seed set affects the results. In sections 2.12.3 we present background on random walks and sweeps. In section 2.4 we show that any seed set that makes up a significant fraction of a target community will produce good results with our seed expansion method, and that most seed sets that are chosen randomly from within a target community will also produce good results. In section 2.5 we present a strippeddown version of the local partitioning techniques developed by SpielmanTeng, and describe quantitatively how the size and quality of the seed set affect the running time and guarantees. In section 2.6 we describe the truncation method used by SpielmanTeng, and describe the heuristics and practical modifications we have employed to keep the running time and total number of vertices examined small.
2.1 Notation
The results in this section apply to graphs that are unweighted
and undirected. Let denote
the adjacency matrix of the graph under consideration, and let
be the diagonal
matrix where
, the
degree of the th vertex.
The volume of a set of vertices is The edge border is denoted
the cutsize is denoted , and the conductance of a set of vertices is
This definition of conductance should not be confused with the conductance associated with a particular random walk on the graph. In particular, the conductance associated with a lazy random walk is a factor of 2 smaller.
2.2 Lazy random walks and sweeps
Given a seed set for
which we hope to find a community, we begin with the
probability distribution
, where
We then simulate several steps of a lazy random walk, computing the probability distributions , where
and where is the lazy random walk transition matrix
If the graph is connected, then converges to the stationary distribution . We are not interested in this limiting distribution, but rather in the distributions obtained after a small number of walk steps. We compute for all up to some specified time . After each step, we sort the vertices in descending order according the degreenormalized probabilities
letting be the th vertex in this order, so that
This ordering defines a collection of sets , where , and is the number of vertices with nonzero values of . The cutsizes, volumes, and conductances for every set can be computed in time proportional to , by determining the change to due to the addition of vertex . This process is referred to as a sweep. We will show that under certain conditions one of the sweep sets will be a good community for the seed . To make the degreenormalized distribution more intuitive, one can view each vertex as consisting of minivertices, two minivertices and for each undirected edge . The vector can be interpreted as a probability distribution on minivertices by letting . During the th step of the lazy walk, is the amount of probability sent from to . To bound the mixing of the lazy random walk, we consider an ordering of the minivertices so that . Using the shorthand notation , we then define
and the related quantity
The functions and provide a strong way to bound how well the walk has mixed; is the maximum amount of probability on any set of minivertices at time , and thus the amount of probability that flows over any set of edges at time is at most . The maximum amount of probability on a set of vertices is at most
and the distance between and the stationary distribution can be written
We will need the following monotonicity lemma, which is a corollary of the result of Lovsz and Simonovits [10].
2.3 Mixing of lazy random walks
For large seed sets, the initial distribution may already be
moderately well mixed, which will lead to better bounds on . In
particular, we have the following lemma.
where .
The following lemma due to Spielman and Teng is a strengthening of a result of Lovsz and Simonovits [10] [11]. It shows that the lazy random walk mixes well unless one of the sweeps sets has both small conductance and large drop in degreenormalized probability between the vertices inside and outside of .
(1) 
 , and
 .
2.4 Good seed sets
Lemma 3
shows that in the absence of a good sweep cut, the walk mixes
rapidly. If we can show that the walk does not mix as rapidly
as the lemma should imply, then one of the sweep cuts must be
good. One way to do this is to present a target community that has small
volume, but contains a large amount of probability from , and use this
to place a lower bound on
. The amount of probability
that has escaped from after steps,
which we will write
,
depends on both the target community and the seed set
. For every set with small
conductance, there are a variety of seed sets for which
is
not much larger than
. To
motivate this, consider the amount of escaping probability when
the seed set is the entire target set . The following
lemma is also due to SpielmanTeng.
Any set that is fairly large and nearly contained in the target community is also a good seed set. The bounds in the lemma below are weak for smaller seed sets, but are sufficient for seed sets that make up a significant fraction of the target community.
Seed sets chosen randomly from within a target community are also likely to be good seed sets for that community. The following result is stated without proof, since it follows by applying a Chernofftype bound to a weighted sum of independent random variables in the usual way.
holds with probability at least
2.5 Communities from a seed set
The following is a strippeddown version of a theorem of SpielmanTeng, extended for a seed set instead of a single starting vertex.
 , and
 ,
 , and
 .
The sweep cut can be viewed as a community of the seed . In addition to having small conductance, there is a significant drop in probability outside by Theorem 1. Figure 3 shows an example of this simultaneous small conductance cut and probability drop.
While these guarantees on are fairly clear, the way that the community is related to the seed set is more subtle. The most obvious thing we can say about this relationship is probably the most important: the community is obtained by taking all vertices where the value of is above a certain threshold. If we view as a measure of closeness to the seed set, this is a strong guarantee.
Unfortunately, is not a good measure of closeness to the seed set when becomes large. As tends to infinity, tends to a constant, and the ordering of the vertices determined by tends to the same ordering determined by the secondlargest eigenvector of , which is independent of the seed set if the graph is connected. This means that the vertices from the seed set are not guaranteed to have the largest values of , and so the seed set is not necessarily contained in the resulting community. However, we have an upper bound on the number of walk steps taken, and we claim that is a good measure of relationship to the seed set for small values of . In practice, we have observed that requiring the community to be determined by a threshold on the values of rules out more degenerate behavior than requiring only that the community contains the seed.
We cannot hope that the resulting community will closely match the target community , without placing additional assumptions on the communities inside . Instead, the target community is used to ensure that some community for will be found, and this community is likely to be mostly contained within the target community. This can be made precise if a stronger restriction is placed on . SpielmanTeng showed that if is roughly , then the gap in probability described in Theorem 1 can be used to show that the majority of vertices in are contained in .
2.6 Truncation
The actual subroutine created by SpielmanTeng, called Nibble, has a stronger guarantee than the method we have presented here. One aspect of Nibble that we would like to reproduce is the ability to find a small community near the seed set while keeping the number of vertices examined small. This is accomplished by using truncated walk distributions in place of exact walk distributions.
In the truncated walk, the probability on any vertex where is set to 0 after each step. This ensures that the support of the truncated walk is small, which gives a bound on the running time of the algorithm, since the time required to perform a step of the random walk and a sweep depends on . For the theoretical guarantees to apply, the truncation parameter must be small enough so that is smaller than the probability gap implied by Theorem 1.
In our experiments we perform a more severe kind of truncation, at a stronger level than is supported by these guarantees. We specify a target volume , sort the vertices in terms of for the current truncated probability vector , and let be the last index such that . We then remove the probability from all vertices where . In section 3.6 we will experimentally show that a target cluster can be accurately found by setting somewhat larger than the cluster size.
3 Experiments on three datasets
3.1 From theory to experiment
The previous sections contained theorems proving that given a sufficiently distinct target set (a subgraph with small enough conductance) and a sufficiently ``good'' seed set lying mostly within it, a lazy local random walk will grow the seed set and return a lowconductance community for the seed set. We also proved that good seed sets are not rare, and that a big enough or random subset of the target set should be good.
Due to constant factors, it is often hard to tell whether a theoretical method will work in practice. Therefore in the following subsections we will describe sanity check experiments on three datasets in which we used a sparse matrix viewer to choose target sets and seed sets that appeared likely to satisfy the preconditions of the theorems, so that we could check whether the method actually recovered the target sets with reasonable accuracy.
3.2 Cut improvement with flow
There are several reasons for using flow to clean up the sweep cuts. First, it is generally a good idea to do this for any spectraltype graph partitioning method [9]. Second, aggressive probability truncation can produce cuts that are even noisier than usual. Third, in the following sections we will see empirical evidence that flowbased improvement makes the sweep cut method less sensitive to the number of walk steps that are taken.
However, this postprocessing needs to be done with care; in some preliminary experiments in which we grew well past the size of the target set and then used parametric flow to shrink back to the cut with optimal conductance of any cut containing the seed set, we sometimes got surprising solutions containing distant, essentially unrelated nodes. Those results were a clear demonstration that conductance and containment of the seed set do not entirely capture our goals in seed set expansion. We actually want to find a lowconductance set that is obtained by adding nearby nodes to the seed set. Back in section 2.5 we argued that if a set of nodes contains a lot of probability from a random walk that hasn't run for too many steps, then it has a version of the desired locality property.
To improve the sweep cuts without sacrificing the locality property, we did some conservative postprocessing with repeated ST max flow calculations across a sliding window covering 10 percent of the total probability.^{1} These calculations were free to rearrange the lower probability nodes near a bottleneck, but not the higher probability nodes near the seed set, thus allowing us to reduce the cutsize without risking the loss of too much enclosed probability. The plots in Figure 1 show that our actual losses of enclosed probability were typically at least an order of magnitude smaller than the cutsize improvement factors.

3.3 .GOV Graph, KSC Seedset
The starting point for this graph was TREC's wellknown .GOV web dataset. To make the graph's block structure more obvious (so that we could visually choose a seed set that probably satisfies the requirements of the theory), we repeatedly deleted all nodes with in or outdegree above 300 or equal to 1 or 2. Then we symmetrized the graph and extracted the largest connected component, yielding an undirected graph with 536560 nodes and 4412473 edges, and node degrees ranging from 3 to 394.
After inspecting this graph's adjacency matrix (see Figure 2), we chose for our target cluster a distinct matrix block containing roughly 3000 URL's associated with the Kennedy Space Center. Among these URL's were 830 beginning with science.ksc.nasa.gov/shuttle. We randomly selected 230 of these nodes as a seed set. Jumping ahead to our results, Table 1 shows that by growing this seed set we can recover nearly all of the science.ksc.nasa.gov/shuttle URL's, plus large fractions of several other families of URL's that relate to the Kennedy Space Center.
Figure 3top shows the probability distribution after 60 steps of a lazy random walk starting with uniform probability on the seed set and zero probability elsewhere. Along the x axis nodes are sorted in order of decreasing degreenormalized probability, which we call ``sweep order''. On the left are the high probability nodes that are close to the seed set. Lower probability nodes appear towards the right of the plot and beyond. The sharp decrease in probability around 25003000 nodes is the signature of the lowconductance bottleneck that we are interpreting as a cluster boundary.
Each curve in Figure 3bottom shows a ``cutsize sweep'' in which we evaluate all cuts between adjacent nodes in the sweep order defined by the probability distribution at some step in the local random walk.^{2} The two curves here are for 30 and 90 steps. Notice that both sweeps show a dip in cutsize around 25003000 nodes, which again reveals the bottleneck at the cluster boundary.
Comparing the 30 and 90step sweeps, we see that the 30step sweep looks better towards the left, while the 90step sweep looks better towards the right. This is because the spreading probability is revealing information about larger sets later. We also see that the flowimproved solutions for 30 and 90 steps are nearly the same; apparently the flowbased postprocessing helps us get good results for a wider range of stopping times.3.4 Sponsored Search Graph, Betting Seedset
This task illustrates what can happen when cluster nesting causes there to be more than one reasonable community for the seed set. The data originated in Yahoo's ``sponsored search'' business (Google's ``ad words'' is roughly similar). In this business advertisers make bids on (bidded phrase, advertiser URL) pairs, which can be encoded in a bipartite graph with bidded phrases on one side and advertiser URL's on the other, with each bid represented by an edge. Coclustering this graph reveals block structure showing that the overall advertising market breaks down into numerous submarkets associated with flowers, travel, financial services, etc. For our experiment we used a 2.4 million node, 6 million edge bipartite graph built from part of an old version of the sponsored search bid database. Node degrees ranged from 1 to about 30 thousand.
Based on earlier experiences we expected that this dataset would contain a distinct gambling cluster which would contain a subcluster focused on sports betting (see the matrix picture in Figure 4). For a seed set we used grep to find all bidded phrases and advertiser URL's containing the substring ``betting''. This resulted in a 594node seed set which includes nodes on both sides of the bipartite graph, and is probably mostly contained within the sports betting subcluster.
Our seed set lies within the sports betting cluster, and this in turn lies within the gambling cluster. To which of these clusters should the seed set be expanded? Perhaps the algorithm should tell us about both of the possible answers. In fact, when there are nested clusters surrounding a seed set, over time the local random walk method often does tell us about more than one possible answer as the probability distribution spreads out.
This behavior can be very clearly seen in the probability distributions plotted in Figure 5top. At 15 steps there is a sharp decrease in probability marking the boundary of the 3000node betting subcluster. At 180 steps there is a different sharp decrease in probability marking the boundary of the enclosing 8000node gambling cluster. Notice that within this larger cluster the probabilities are becoming quite uniform at 180 steps, so it is no longer easy to see the boundary of the inner cluster.
The cutsize sweeps of Figure 5bottom show what all this means in terms of cuts. At 15 steps the boundary of the inner cluster is sharply defined by a dip in cutsize near 3000 nodes. Also at this point we are starting to get a rough preview of the boundary of the enclosing cluster. This preview is much improved by the flowbased postprocessing.
At 180 steps, the inner dip in cutsize delimiting the betting subcluster has been washed out (but can be somewhat recovered with flow) but now we have a good dip near 8000 nodes which is the boundary of the outer gambling cluster.
3.5 Movie/Actress Graph, Spain Seedset
This task has even more cluster nesting than the previous one. The graph is bipartite and was built from the IMDB file relating movies and actresses. It has a clear block structure that is strongly correlated with countries.^{3} Therefore we use countryofproduction labels for the movies as cluster membership labels for purposes of choosing seed sets and measuring precision and recall.
We cleaned up the problem a bit by deleting all multicountry movies and many nonmovie items such as television shows and videos. We combined several nearly synonymous country labels (e.g. USSR and Russia) and then deleted all but the top 30 countries. Finally we deleted all degree1 nodes and extracted the largest connected component. We ended up with a bipartite graph with 77287 actress nodes, 121143 movie nodes, and 566756 edges. The minimum degree was 2, and the maximum degree was 690.
For a target cluster we chose Spain, whose matrix block can be seen in Figure 10. It appears to part of a supercluster containing other Spanish and Portugueselanguage countries. One can also see many cross edges denoting appearances of Spanish actresses in French and Italian movies, and vice versa, so a Romancelanguage supercluster would not be surprising. For a seed set we randomly selected 5 percent of the movies produced in Spain. This seed set contained 179 movie nodes, and 0 actress nodes.
The curves in Figure 6 show cutsize sweeps based on the probability distribution after 40 and 210 walk steps. The 40step sweep (open circles) contains a sharp dip at 6500 nodes which marks the boundary of the Spain cluster with high accuracy. Beyond this bottleneck the 40step sweep is pretty sketchy, but the flowbased cuts derived from it (filled circles) are already surprisingly good.
The open triangle curve in Figure 6 shows that at 210 steps, the random walk has already washed out the boundary of the Spain cluster, but is now giving a better view of the enclosing hierarchy of four superclusters, especially those containing 9000 and 17500 nodes. The filled triangles show that flowbased postprocessing recovers the washedout boundary of the Spain cluster, and greatly improves the boundaries of the 27000 and 45000 node superclusters.
Because we have country labels for the movies we can interpret these clusters, and also measure the accuracy with which they are being recovered. The following table contains precision and recall measurements for the movies^{4} in the sets of nodes delimited by the (flowimproved) leftmost, middle, and rightmost dips in Figure 6. These precision and recall measurements show that the 6500node set is a good approximation of our target cluster of Spain. The 17500node set (which has the smallest conductance) is a very good approximation of a mostly Spanishlanguage supercluster containing Spain, Mexico, and Argentina, plus Portugal. The 45000node set is a good approximation of a Romancelanguage supercluster containing the previous 4 countries plus Brazil, Italy, and France.
quotient = cutsize / nodes  preci.  recall  cluster contents 
0.255835 = 1666 / 6512  0.956  0.979  Spain 
0.109531 = 1910 / 17438  0.979  0.995  Spanish+Portugal 
0.129574 = 5821 / 44924  0.958  0.988  Romance language 
3.6 Sweeps from Truncated Walks
In the experiments up to now we have done full walks with probability spreading to the whole graph. The computation can be made more local by doing a truncated walk, as discussed in section 2.6. Here we briefly examine the qualititative effect of truncating to a fixed number of nodes after each step. Figure 7 compares cut sweeps over node orderings derived from truncated and untruncated walks on the sponsored search and movies/actresses graphs. In the first two plots we truncated to the top 10000 nodes after each step. Up through the bottlenecks near 8000 and 6500 nodes, respectively, these new sweeps look similar to the earlier sweeps over full walks. In the final plot we truncated to 6000 nodes on the movies/actresses task, which is a bit too small for recovering the 6500node Spain cluster. In this case the results are degraded, but somewhat recoverable with flowbased postprocessing.
4 Fast implementation
One of the main features of the truncated random walks partitioning method is that its time and space requirements can be independent of the graph size. We will describe some code that basically achieves this time independence. [However, it still uses working space.] Our data structures, and the algorithm for one iteration are sketched in Figure 8. The graph contains nodes and edges, and is representable in CSR format in (n+1)*4 + 2*m*4 bytes.
We use five working arrays. is a scratch array for probabilities containing doubles, mostly zero. is a scratch array containing flag bytes, mostly zero. Together they occupy bytes.^{5} We use parallel arrays of doubles and node_id's to compactly represent small sets of nonzero values associated with ``active'' nodes. ``Maxlive'' is an upper bound on the number of active nodes. and are two arrays of node_id's each consuming bytes. is an array of values (degreenormalized probabilities) consuming bytes.
The key to speed and scalability is to never do any time operation, including clearing arrays or , except at program initialization time. While expanding a seed set, we only do operations whose time depends on the number of active nodes.
Push_probability
works as follows. The input represents the currently
active nodes. We zero the counter . The arrays and are
already zeroed. For each active node we push half of its
probability to itself, while the other half is evenly divided
and pushed to its neighbors. To push some probability to node , we do:
{
+; if (0=) {
+}}.
We end up with
nonzero probabilities scattered in . The entries in
tell where they
are, so we can gather them back into , simultaneously
clearing the nonzeros in and .
We now describe some timings on two large graphs. The 191million node graph IM_full is a symmetrized version of the buddy list graph for users of Yahoo Instant Messenger. We used an outofcore reimplementation of Metis to partition this graph into 8 big pieces, one of which is the 22million node graph IM_eighth. We also used Metis [6] to break both graphs into numerous tiny pieces of size 150200. For each graph, we randomly selected 100 of these tiny pieces to serve as seed sets for 100 runs of our truncated random walk procedure. Each run went for 100 iterations. We used volumebased truncation, on each step cutting back to a volume (sum of degrees) of 20000. On average, the probability pushing step grew the active set from about nodes out to about nodes, then the truncation step cut it back to about nodes.
Figure 9 summarizes these graphs and runs, which were done on a 4processor 2.4 GHz Opteron with 64 Gbytes of RAM. The average time per iteration was only 9.2 and 13.5 msec for the two graphs. Iterations on IM_full took 1.5 times longer, even though the graph is 8 times bigger. We actually hoped that the iteration time would be the same. A possible explanation for the increase is that there were more cache misses for the bigger graph (the probability pushing step involves a pattern of ``random'' memory accesses). To put these times into perspective, we also list some ``bogus'' times where we actually checked our invariant that the C and D arrays are zeroed at the top of the loop. This checking takes time, resulting in a huge increase in time per step, and a factor of 8 increase from the smaller to larger graph.
4.1 Choosing Parameters
So far we haven't said much about how to choose values for the parameters (the number of walk steps), and (the truncation size). It is clear from the previous sections that a seed set can have several possible communities; to choose between them we would need to retain some freedom in choosing parameter values. According to Theorem 1, it suffices to set to be roughly , where is a lower bound on the desired conductance and is an upper bound on the expansion factor . [We note that the method often works with much smaller values for .]
One can obtain an algorithm that depends on a single parameter by choosing a value of and searching through several values of , setting , and . The table below shows the results of such a search on the movies/actresses task, with set to , and with set to for each between 5 and 10. For each value of , we return the cut with the smallest conductance found by any of the sweeps. The following table reports the conductance and volume of this cut, and also the step at which it was found.
conductance  volume  t  
.1280  27621  480  
.0452  37499  250  
.0240  103720  700  
.0193  103894  280  
.0162  127374  620  
.0129  835500  780 
Bibliography

 1

Krishna Bharat and Monika R. Henzinger.
Improved algorithms for topic distillation in a hyperlinked environment.
In ACM SIGIR98, pages 104111, Melbourne, AU, 1998.  2

Soumen Chakrabarti, Byron E. Dom, and Piotr
Indyk.
Enhanced hypertext categorization using hyperlinks.
In Laura M. Haas and Ashutosh Tiwary, editors, Proceedings of ACM SIGMOD98, pages 307318, Seattle, US, 1998. ACM Press, New York, US.  3

Fan Chung and Lincoln Lu.
Connected components in random graphs with given degree sequences.
Annals of Combinatorics, 6:125145, 2002.  4

Gary Flake, Steve Lawrence, and C. Lee Giles.
Efficient identification of web communities.
In Sixth ACM SIGKDD, pages 150160, Boston, MA, August 2023 2000.  5

Zoltn Gyngyi, Hector GarciaMolina, and Jan
Pedersen.
Combating web spam with trustrank.
In VLDB, pages 576587, 2004.  6

George Karypis and Vipin Kumar.
A fast and high quality multilevel scheme for partitioning irregular graphs.
SIAM Journal on Scientific Computing, 20:359  392, 1999.  7

Jon M. Kleinberg.
Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46(5):604632, 1999.  8

Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and
Andrew Tomkins.
Trawling the Web for emerging cybercommunities.
Computer Networks, 31(1116):14811493, 1999.  9

Kevin J Lang.
Fixing two weaknesses of the spectral method.
In NIPS, 2005.  10

Lszl Lovsz and Mikls Simonovits.
The mixing rate of markov chains, an isoperimetric inequality, and computing the volume.
In FOCS, pages 346354, 1990.  11

Lszl Lovsz and Mikls Simonovits.
Random walks in a convex body and an improved volume algorithm.
Random Struct. Algorithms, 4(4):359412, 1993.  12

Daniel A. Spielman and ShangHua Teng.
Nearlylinear time algorithms for graph partitioning, graph sparsification, and solving linear systems.
In ACM STOC04, pages 8190, New York, NY, USA, 2004. ACM Press.  13

M. Toyoda and M. Kitsuregawa.
Creating a web community chart for navigating related communities, 2001.
Footnotes
 ... probability.^{1}
 A different rule of thumb that is more motivated by the theory is that the sweep cut can be safely improved using a window in which the number of ``active'' vertices to the left and right of be such that the amount of active volume on each side is .
 ... walk.^{2}
 It might seem confusing that our sweep plots show cutsize while the random walks algorithm optimizes conductance. However, conductance is a particular way of combining our twin goals of 1) growing the seed set and 2) finding a small boundary. These plots of cutsize as a function of node count display a range of possible tradeoffs between these two goals.
 ... countries.^{3}
 Also many countries contain subblocks delimited by disruptive events like the silent / talkie transition, or WWII.
 ... movies^{4}
 but not the actresses, for which we lack country labels.
 ... bytes.^{5}
 Note that we could get below working space by using hash tables for and . An even fancier program could use hash tables to store the relevant part of the graph, with the full graph stored on disk or on a remote server.