Next: Experimental Validation Up: A Case Study: Index Previous: Time Complexity
PageGather addresses the problem of finding sets of items based on their access patterns. Previous work in both clustering and data mining can also be applied to this problem. As discussed above, cluster mining is a variation on traditional clustering. PageGather uses a graph-based clustering component which is specialized to cluster mining, but it is possible to adapt traditional clustering algorithms to this problem. In  and follow-up work, we compared PageGather's clustering component (both PGCLIQUE and PGCC variants) to two standard algorithms: K-means clustering  and hierarchical agglomerative clustering (HAC). There are literally hundreds of clustering algorithms and variations thereof. We chose K-means as it is particularly fast, and HAC as it is widely used. We found that PageGather's clustering component was faster than both HAC and K-means and found higher-quality clusters.
Frequent set algorithms are designed to find sets of similar items in large collections (see, for example, [1,2,19,21]). We therefore compare PageGather to the standard Apriori algorithm for finding frequent sets (see ). In a traditional frequent set problem, the data is a collection of market basket information. Each basket contains a set of items, and the goal is to find sets of items that appear together in many baskets. In our problem domain, a user visit corresponds to a market basket, and the set of pages visited by the user corresponds to the set of items in the basket. In section 3, we compare the performance of PageGather to that of Apriori on our test data.
Next: Experimental Validation Up: A Case Study: Index Previous: Time Complexity Mike Perkowitz