The PageGather Algorithm next up previous
Next: Time Complexity Up: A Case Study: Index Previous: The Index Page Synthesis

   
The PageGather Algorithm

In this section, we introduce PageGather. Given a large access log, our task is to find collections of pages that tend to co-occur in visits. Clustering (see [22,16,24]) is a natural technique to consider for this task. In clustering, documents are represented in an N-dimensional space (for example, as word vectors). Roughly, a cluster is a collection of documents close to each other and relatively distant from other clusters. Standard clustering algorithms partition the documents into a set of mutually exclusive clusters.

Cluster mining is a variation on traditional clustering that is well suited to our task. Instead of attempting to partition the entire space of documents, we try to find a small number of high quality clusters. Furthermore, whereas traditional clustering is concerned with placing each document in exactly one cluster, cluster mining may place a single document in multiple overlapping clusters. The relationship between traditional clustering and cluster mining is parallel to that between classification and data mining as described in [20]. Segal contrasts mining ``nuggets'' -- finding high-accuracy rules that capture patterns in the data -- with traditional classification -- classifying all examples as positive or negative -- and shows that traditional classification algorithms do not make the best mining algorithms.

The PageGather algorithm uses cluster mining to find collections of related pages at a web site, relying on the visit-coherence assumption. In essence, PageGather takes a web server access log as input and maps it into a form ready for clustering; it then applies cluster mining to the data and produces candidate index-page contents as output. The algorithm has five basic steps:

1.
Process the access log into visits.
2.
Compute the co-occurrence frequencies between pages and create a similarity matrix.
3.
Create the graph corresponding to the matrix, and find maximal cliques (or connected components) in the graph.
4.
Rank the clusters found, and choose which to output.
5.
For each cluster, create a web page consisting of links to the documents in the cluster, and present it to the webmaster for evaluation.
We discuss each step in turn.

1. Process the access log into visits. As defined above, a visit is an ordered sequence of pages accessed by a single user in a single session. An access log, however, is a sequence of page views, or requests made to the web server.[*] Each request typically includes the time of the request, the URL requested, and the machine from which the request originated. For our purposes, however, we need to extract discrete visits from the log. We first assume that each originating machine corresponds to a single visitor.[*] A series of page views in a day's log from one visitor,[*] ordered by their time-stamps, corresponds to a single session for that visitor.

2. Compute the co-occurrence frequencies between pages and create a similarity matrix. For each pair of pages p1 and p2, we compute P(p1|p2), the probability of a visitor visiting p1 if she has already visited p2 and P(p2|p1), the probability of a visitor visiting p2 if she has already visited p1. The co-occurrence frequency between p1 and p2 is the minimum of these values.

We use the minimum of the two conditional probabilities to avoid mistaking an asymmetrical relationship for a true case of similarity. For example, a popular page p1 might be on the most common path to a more obscure page p2. In such a case P(p1|p2) will be high, perhaps leading us to think the pages similar. However, P(p2|p1) could be quite low, as p1 is on the path to many pages and p2 is relatively obscure.

As stated above, our goal is to find clusters of related but currently unlinked pages. Therefore, we wish to avoid finding clusters of pages that are already linked together. We prevent this by setting the matrix cell for two pages to zero if they are already linked in the site.

We observed that the similarity matrix could be viewed as a graph, which enables us to apply graph algorithms to the task of identifying collections of related pages. However, a graph corresponding to the similarity matrix would be completely (or almost completely) connected. In order to reduce noise, we apply a threshold and remove edges corresponding to low co-occurrence frequency.


  
Figure 2: (a) A candidate cluster to be presented to the webmaster for approval and naming. (b) How the final page would appear at the site, properly named and formatted.


  
Rank Topic Coherent? Useful?
1 Images of Korg keyboards yes somewhat
2 Images of Korg keyboards somewhat somewhat
3 Sequencers yes yes
4 Sequencers yes yes
5 Roland Instruments no no
6 Samplers yes yes
7 Instrument samples yes somewhat
8 Samplers yes yes
9 Instrument samples yes somewhat
10 Books, do-it-yourself info somewhat yes
Figure 3: Clusters found by PGCLIQUE with overlap reduction. Clusters are ordered as ranked by PageGather. Topics are provided by the webmaster based on the cluster contents. All clusters are subjectively rated by the webmaster for the following qualities: coherence is the degree to which all of the cluster contents correspond to the apparent topic; usefulness rates whether the topic would be of interest to the site's users. Note that, in spite of overlap reduction, multiple clusters with the same topic are found. Korg and Roland are instrument manufacturers; keyboards, sequencers, and samplers are types of instruments.

3. Create the graph corresponding to the matrix, and find maximal cliques (or connected components) in the graph. We create a graph in which each page is a node and each nonzero cell in the matrix is an arc. Next, we apply graph algorithms that efficiently extract connectivity information from the graph (e.g., the linear-time algorithm for identifying connected components). The frequency information on the arcs is ignored in this step for the sake of efficiency. By creating a sparse graph, and using efficient graph algorithms for cluster mining, we can identify high quality clusters substantially faster than by relying on traditional clustering methods [15]. We may extract two kinds of subgraphs, leading to two variants of the PageGather algorithm.

In step 2, we applied a threshold to the similarity matrix. When this threshold is high, the graph will be sparse, and we will find few clusters, which will tend to be of small size and high quality. When the threshold is lower, we will find more, larger clusters. Note that PGCLIQUE and PGCC will generally require different threshold settings; a sparse graph that contains a number of connected components may be too sparse to contain any sizeable cliques. We tune the threshold experimentally so as to find a sufficient number of clusters of the approximate desired size; PGCLIQUE, however, generally finds a small number of small clusters. Generally, we find that PGCLIQUE performs better than PGCC and so is the variant we use. We compare these variants experimentally in section 3.

4. Rank the clusters found, and choose which to output. Step (3) may find many clusters, but we may wish to output only a few. For example, the site's webmaster might want to see no more than a handful of clusters every week and decide which to turn into new index pages. Accordingly, all clusters found are rated and sorted by averaging the co-occurrence frequency between all pairs of documents in the cluster. We find that PGCLIQUE tends to discover many similar clusters. Therefore, we have developed two ways of reducing the number of similar clusters in the final results.

Both of these approaches require an overlap measure; we use the size of the intersection between two clusters divided by the size of their union. In both variants, the overlap threshold is a parameter that has been tuned experimentally. Note that, as connected components never overlap, neither reduction nor merging will have any effect on PGCC. Reduction and merging each have advantages and disadvantages. In general, reduction preserves the coherence of clusters found (as it makes no changes to cluster contents) but may miss pages that could be clustered together. Merging, on the other hand, may combine all related pages together, but at the cost of reducing the overall coherence of the clusters. By default, we use reduction in order to preserve the coherence of our clusters. These two variants are compared to each other -- as well as to the ``raw'' ranked list -- in section 3.

However we process the ranked list of clusters, we return a bounded number of results, and apply a quality threshold - sometimes returning no results at all. The webmaster specifies this bound -- e.g. ``give me at most ten clusters above quality threshold X...''.

5. For each cluster found, create a web page consisting of links to the documents in the cluster, and present it to the webmaster for evaluation. The PageGather algorithm finds candidate link sets and presents them to the webmaster as in figure 2. The webmaster is prompted to accept or reject the cluster, to name it, and remove any links she thinks inappropriate. Links are labeled with the titles of the target pages and ordered alphabetically by those titles. The webmaster is responsible for placing the new page at the site.


next up previous
Next: Time Complexity Up: A Case Study: Index Previous: The Index Page Synthesis
Mike Perkowitz
1999-03-02