A Matrix Density Based Algorithm to Hierarchically
CoCluster Documents and Words
Bhushan Mandhani^{1}, Sachindra Joshi^{2},
Krishna Kummamuru^{2}
1: mandhani@ee.iitb.ac.in, Indian Institute of Technology, Bombay, India
2: {jsachind, kkummamu}@in.ibm.com, IBM India Research Lab, New Delhi, India
Copyright
is held by the author/owner(s).
WWW2003, May 2024, 2003, Budapest, Hungary.
ACM 1581136803/03/0005.
Abstract:
The algorithm is a partitionalagglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most ``similar'' submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters that we use here for the merging cleverly uses the fact that the clusters here are coclusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical KMeans and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm.
Categories and Subject Descriptors
H.3.3[Information Storage and Retrieval]: Information
Search and Retrieval  Clustering
General Terms
Algorithms, Performance, Experimentation
1 Introduction
The Web has grown from a few hundred thousand pages in early 90s to over 2 billion pages in 2002. IDC reports that over 200 million pages are added every month while at the same time 100 million pages become obsolete. A similar trend has been observed in intranets too. With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner. A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo, Google and Dmoz web directories has shown. The Dmoz directory has been manually created by about 52 thousand editors. Though manually generated directories are more comprehensible and accurate, they are not always feasible. Therefore, there is a need for tools to automatically arrange documents into labelled hierarchies.
Clustering of documents is an important part of this process. The nodes at each level of the hierarchy can be viewed as a clustering of the documents, with the number of clusters increasing as one goes deeper in the hierarchy. The popular classes of document clustering algorithms are kmeans and its variants, hierarchical agglomerative clustering (HAC) methods [1], and more recently, graph partitioning methods [2,3]. For kmeans, the best performing similarity measure between documents is the cosine measure between two document vectors [4,5]. HAC algorithms start with singleton documents as clusters, and iteratively merge the two most similar clusters. They differ in their choice of similarity measure between clusters [6,7]. There also exist other methods tailored for clustering search results [8,9]. We will look more closely at clustering literature in section 5.
The model commonly used for documents is the bagofwords or vector space model. A set of words is chosen from the set of all words in all documents using some criterion. Each document is a vector in the feature space formed by these words. The vector entries can be word frequencies or some other measure. For any moderately sized document set, the number of words runs into a few thousands. Thus document vectors are highdimensional. Putting these vectors together, we get the documentword matrix which is extremely sparse.
Our proposed algorithm RPSA is a twostep partitionalagglomerative algorithm. The partitioning step involves the creation of ``dense'' submatrices (we define ``dense'' later) so that the respective row sets partition the set of rows of the complete matrix i.e. partition the document set. We will call these submatrices ``leaf clusters'', since they will form the leaf nodes of the hierarchy subsequently created in the agglomerative step. Note that leaf clusters, being submatrices, are documentword coclusters. The number of these leaf clusters will typically be much larger than the number of clusters desired. Note that though a document can belong to only one leaf cluster, a word can occur in several of them. For the creation of individual leaf clusters, we have used an algorithm which, given a document, creates a leaf cluster containing that document. Thus the partitioning step can be viewed as leader clustering. The second step is hierarchical agglomerative, in which the most similar clusters are successively merged. Our algorithm can be used for both flat clustering, and hierarchical clustering (even though it naturally creates hierarchies even in the former case).
The organisation of this paper is as follows. In the next section, we describe how the document vectors are created, and how we will be evaluating the quality of the clustering, for both the flat and the hierarchical cases. In Section 3, we describe the proposed RPSA algorithm, from the creation of the leaf clusters, to obtaining the final labelled hierarchical clustering. In Section 4, we present our experimental results. For the flat clustering case, we compare our results with those given by bipartite spectral graph partitioning (see [2]), and spherical kMeans. We also compare the labels generated by us with that by the former algorithm. For hierarchical clustering, we have used a couple of data sets downloaded from the dmoz directory. In Section 5, we look at related work, and how RPSA compares with some of the existing algorithms for clustering documents. Finally, we present our conclusions in Section 6.
2 Background
In this section, we first describe how feature selection is done, and how document vectors are created. We then describe the measures used to evaluate the quality of a flat clustering, and finally we talk about how we are going to evaluate the hierarchies created.
2.1 Vector Model for Documents
In the preprocessing, a ``dictionary'' of all the words occurring in the document set is created. Stopwords are then removed. For each word, a count of the number of documents in which it occurs is kept. We then remove words occurring in less than 0.2% of the documents i.e the lowfrequency words, and those occurring in more than 20% of the documents i.e the highfrequency words. The remaining words constitute our feature space. They make up the columns of the documentword matrix.
2.1.1 Creation of Document Vectors
Suppose the feature space consists of n words, and we have d documents. Document i is represented by m_{i}, a vector in the space R^{n}. The component of this vector corresponding to the jth word is given by:
2.2 Evaluation of Clustering Quality
The following two measures are commonly used to judge cluster quality, given an external partitioning of the documents of the data set into labeled classes. Suppose we have g classes with labels {1, 2,..., g} and k clusters {X_{1}, X_{2},..., X_{k}}. We denote by n_{j}^{i} the number of members of class i in cluster X_{j}, and by n_{j} the number of elements making up X_{j}.
 Purity: It is defined to be:
max_{i}{n_{j}^{i}}where i varies over all class labels, and d is, as before, the number of documents. Purity is thus simply the classification accuracy assuming all members of a cluster were predicted to be members of the dominant class in that cluster.
 Entropy: The entropy of cluster X_{j} is defined as:
E_{j} =  p_{ij}log p_{ij}where p_{ij} is the fraction of members of X_{j} that come from class i i.e. p_{ij} = n_{j}^{i}/n_{j}. The entropy of the total clustering is defined as:E = n_{j}E_{j}
Though entropy, unlike purity, takes into account the entire confusion matrix i.e all the numbers n_{j}^{i}, we believe purity to be a better measure. We explain this with an example. Suppose we represent cluster X_{j} as a vector in R^{g} with the ith component being p_{ij}. Suppose g = 4, and we have one cluster given by (0.5, 0.5, 0, 0) and another given by (0.8, 0.1, 0.05, 0.05). The second cluster can be easily identified with the first class, and is clearly the better cluster, and has a higher purity value. However, the entropy of the first cluster is 0.693 while that of the second is 0.708. Entropy identifies the first cluster as the better one! We will be using purity to evaluate our results.
2.3 Evaluation of Hierarchical Clustering
While there exist several ways to evaluate the quality of clusters in a flat clustering, doing the same for hierarchichal clustering is difficult and less researched in literature [10]. To compare the hierarchies generated by our algorithm with the original ones, we use the following criteria. The difference in the number of nodes between the two at each level.
 The purity at a particular level of the generated hierarchy is computed by taking the nodes of the original hierarchy of the corresponding level as the external labeled classes. Thus purity at a level is a measure of the clustering quality achieved at that hierarchy level.
 We compare the generated node labels with those in the original hierarchy.
In the original hierarchies, all documents belong to some leaf node. Documents originating at intermediate nodes are not allowed.
3 Rowset Partitioning and Submatrix Agglomeration (RPSA)
3.1 The Partitioning Step
The partitioning step of RPSA, as mentioned earlier, creates dense submatrices whose row sets partition the row set of the complete matrix. We will refer to these submatrices as leaf clusters since they are the leaf nodes of the hierarchy subsequently generated. In this section, we will first look at an algorithm that, given a document and a density parameter (explained later), creates a leaf cluster containing that document. We will then see how successive input documents to this algorithm are chosen, and the considerations involved in choosing the density parameter. Finally, we will put these all together to get the complete partitioning step.
We define the density of a matrix of reals as the average value of the entries of the matrix. Given our documentword matrix M, a set R of rows and a set C of columns define a submatrix S of M. We will use d (S) to denote the density of S. The density of a row r over column set C is defined as the average of the entries of r lying in the columns making up C i.e
3.1.1 Generating a Leaf Cluster
We will now consider an algorithm that, given a row i and value
> 1, outputs a submatrix
S of
M
containing the row i and having density approximately
d (M). This algorithm is an
extension of the pingpong algorithm proposed by [11],
which does the same job for a binary matrix. It is described by the
following steps:
1. Initialize row set
R = {i} and
column set
C to null. Set row threshold t_{r} and column
threshold t_{c} equal to
d (M).
2. Add to
C the columns c whose density over
R is
greater than or equal to t_{c}.
3. Similarly, add to
R
the rows r whose density over
C is greater than or equal
to t_{r}.
4. If
d (S) d (M),
 if new columns or rows were not added in steps 2,3 above, set t_{r} = 0.9t_{r}.
 go to step 2.
Thus the above algorithm, starting with a single row alternately adds columns and rows, stopping after every cycle of addition to check that d (S) is above the threshold. When it can't add new rows or columns, it selectively reduces the row threshold, keeping the column threshold constant. Often the output of this algorithm contains only a few rows, so we selectively reduce the row threshold to encourage the output to have a larger number of rows. In [11], the row and column selection criteria are designed for a {0,1}matrix and not for a matrix of reals. Further, there is no differentiation between rows and columns.
3.1.2 Choice of Leader Documents
The first input to the above algorithm is a row number (document). This document can be viewed as the ``leader'' for the subsequently generated leaf cluster. We now see how leader documents are successively chosen to form leaf clusters.
We define the length l (i) of a document i as the sum of the components of the TFIDF vector representing that document. Note that this vector is different from the one representing the document in M, where we have used (unitizedTF)IDF.
Documents with relatively large lengths were observed to be better leader documents for the algorithm above. They generally created large and meaningful document clusters, with the cardinality of the associated word clusters typically being 2 to 5 times that of the corresponding document clusters. Documents with the smallest lengths often created skewed clusters having very few rows, and a large number of columns. Such leaf clusters introduce errors in the agglomerative step, and are undesirable.
Suppose at some stage m leaf clusters have been formed. The jth leaf cluster is the submatrix S_{j}, defined by the row set R_{j} and the column set C_{j}. The leader for the (m + 1)th cluster has to be chosen from the set L of documents not occurring in any of the m clusters formed so far. For document i in L, we define its membership M_{ij} in cluster j as the density of row i over C_{j}. It is a measure of the similarity of i with the documents occurring in cluster j. Thus the cumulative membership of i in the clusters formed so far is:
M_{i} is a measure of the overall similarity of i with the the existing clusters. As leader, we want a document i in L such that l (i) is relatively large and M_{i} is as small as possible. Thus from among the lengthwise top onethird documents of L, we choose the document i for which M_{i} is minimum. Thus we have in place the algorithm to choose leader documents.
3.1.3 The Density Parameter
The second input to the leaf cluster creating algorithm is the density parameter . The density of the generated leaf cluster S is approximately d (M). Clearly, the larger the value of , the smaller the size of S will be and vice versa. The typical sizes of the generated leaf clusters for a given were found to be a function of the number of total documents alone, and independent of the document set involved. This is not surprising, since in creating the matrix M we had first unitized the TF vector, cancelling the effect of varying document sizes. What this means is that doesn't need to be tweaked for different data sets.We experimented with values of varying from 10 to 50. For the larger values, though the leaf clusters formed have high purity, they are small, and thus the number of leaf clusters is large. This not only slows down the algorithm, but more importantly, causes errors in the merging step, with the final clusters being unbalanced in size, and the larger ones having fairly low purity. Low values of result in slightly lower purity of leaf clusters, but these leaf clusters are larger, and lesser in number. This results in faster running time and, vitally, no errors in the merging. In all the results shown in this paper, we have used = 20.
3.1.4 The Complete Partitioning Algorithm
The complete partitioning algorithm is as follows:
 Set variable leader equal to the document with largest length. Initialize the set L of uncovered documents with the entire document set. Set the number of leaf clusters formed num = 0.
 Increment num. Form the leaf cluster S_{num} using document leader. Remove from L the documents covered by this new cluster.
 If the number of covered documents is less than some threshold of the total document set, find the next leader using the ``choose leader'' algorithm in section 3.1.2 and go back to the above step. Otherwise proceed.
 Suppose we now have m leaf clusters. Each document i in L, is then put into the leaf cluster S_{j} for which the density of i over the corresponding column set C_{j} is maximum, for j varying from 1 to m.
In our experiments, we used 80% for the value of . The reason for going upto only 80% of the documents in creating the leaf clusters is that towards the end we start getting clusters with only a few documents (we mentioned earlier that this is mostly the case with leader documents with small lengths), and it becomes more efficient to populate the remaining documents in the existing clusters as we do in step 4. Thus, we now have in place a partitioning of the document set into leaf clusters.
3.1.5 Complexity Analysis
We now discuss the time and space complexity of the partitioning algorithm described above, in terms of the number of nonzero entries z in the d×n documentword matrix M, and the total number of leaf clusters created m. Though m is O(d ) in the worst case, our results later show that m increases very slowly, and not linearly with d. In fact, as d varies from 90 to 3893, m varies from 32 to 117. So we will use m as a parameter in our analysis. Additionally, we use an efficient representation for the sparse matrix M and only store its z nonzero elements.
In step 2, a leaf cluster is created using the algorithm of section 3.1.1. When creating a leaf cluster, we maintain a list containing for each prospective row, the sum of its values lying in the current column set of the leaf cluster. To choose the rows that are to be added, we traverse this list updating the sum for each prospective row by adding to it its values lying in the columns that were added to the cluster in the preceding step. This sum is then divided by the total number of columns in the cluster to get the row density, and if this exceeds the row threshold, it is added to the cluster. Excluding the time taken for updating the row sums for the time being, the operation is constanttime for each prospective row. Thus the time for adding rows in a single step is O(d ). Each column is added to the leaf cluster at most once. When a column is added, the row sums of those prospective rows have to be updated whose entry in this column is nonzero. Each such entry requires one addition. Thus, across all column additions, the total time for updating row sums is O(z). Similarly, the time for adding columns in a single step is O(n), and the total time for updating column sums is O(z). If we have c addition cycles for creating some leaf cluster, the time taken for it becomes O(z + cd + cn). The maximum value of c was found to be 31 for dmoz300 (a data set of 300 dmoz pages) and 36 for dmoz1800 (1800 dmoz pages). We can choose to bound c with a constant , without affecting the performance of the algorithm. Thus, the time for creating a leaf cluster becomes O(z). The time for creating m such leaf clusters is O(mz).
In step 3, a new leader document is chosen using the method of section 3.1.2. We maintain a list containing the cumulative membership values for each uncovered document. When a new leaf cluster is created, the membership of each uncovered document in this new cluster needs to be computed to update this list. The uncovered documents and the column set of the newly formed leaf cluster make up a submatrix of M. Each nonzero entry of this submatrix results in one addition, giving O(z) additions. The number of divisions and comparisons is O(d ). Thus the time for updating this list, and identifying the next leader is O(z). Moreover, step 3 is repeated (m  1) times. Thus the total time for choosing leader documents is O(mz). Finally, in step 4, we need to compute, within each of the m leaf clusters, the membership of each uncovered document. We again get for each leaf cluster, a submatrix every nonzero entry of which requires one addition, and every row one division. For m leaf clusters, the total time is again O(mz).
Thus, the time complexity of the complete partitioning algorithm is O(mz). For each leaf cluster, we only need to store the corresponding row numbers and column numbers. The row sets are disjoint, and we don't expect a column to occur in too many leaf clusters. The total space required for the leaf clusters is thus O(d + n). The space needed for the matrix M is O(z). Thus the space complexity is O(z).
3.2 The Agglomerative Step
The partioning step of RPSA gives a fairly large number of leaf clusters of dense submatrices. This clustering is usually fine grained. The agglomerative step reduces the number of clusters, and makes it coarser. Note, that this step is usually required even if we require a flat clustering. We will first describe this step for flat clustering, and then consider the case when the goal is to obtain a hierarchical clustering.
3.2.1 Flat Clustering
We successively merge the most similar clusters in a hierarchical agglomerative step, until we are down to the required number of clusters. The similarity measure between two clusters that we use for the merging, cleverly exploits the fact that the clusters are coclusters, and thus submatrices, and is key to the working of the algorithm.
Recall that the ith cluster is a submatrix S_{i} defined by the row set R_{i} and the column set C_{i}. For clusters i, j, the density of the submatrix formed by the row set R_{i} and the column set C_{j} is a measure of how much we can expect to find the words of cluster j in the documents making up cluster i. A similar comment can be made about the submatrix formed by the combination of R_{j} and C_{i}. We define the similarity between clusters i, j as the average of all the elements making up these two submatrices.
When we merge clusters i, j, the row set of the obtained cluster is the union of the row sets R_{i} and R_{j}, and the column set is, similarly, the union of the corresponding column sets. Thus, using a larger set of small dense submatrices covering all the rows of M, we create a smaller set of larger, sparser submatrices covering all the rows of M. This gives us the final clustering.
3.2.2 Hierarchical Clustering
The merging process for hierarchy generation is the same as above. The difference is that when we start out, the leaf clusters are all assigned a height 0. Subsequently,
 When two nodes (clusters) having the same height are merged, a new node is created, and assigned a height one greater than the height of these. This new node has these two nodes as children.
 When two nodes with different heights are merged, the lower node is added to the children of the higher node, and the higher node is replaced by the merged node (no new node is created).
3.2.3 Generating Labels for Clusters
Labels for cluster i come from the column set C_{i}. We define the degree of description of cluster i for a word w in C_{i} as the density of column w over the corresponding row set R_{i}. We expect that the greater this value, the better the description that w provides of the cluster. Thus we can sort the words making up C_{i} in descending order of their degrees of description. In the results in the next section, the top 7 words have been shown for the clustering experiments, while for hierarchical clustering, the top word for a node has been used as its label.
3.2.4 Complexity Analysis
This hierarchical agglomerative step begins with m leaf clusters, and has complexity quadratic in m. Computing the similarity between two clusters requires taking the average of the elements of two disjoint submatrices of M, and thus takes time O(z). Thus the total time for the agglomerative step is O(m^{2}z). Finally, for each hierarchy node, we store the numbers of the leaf clusters lying in the subtree rooted at the node, and pointers to the children of the node. There are m leaf clusters and O(m) nodes. Thus the extra space required for the agglomerative step is O(m^{2}).
4 Experimental Results
In this section we present the results of our proposed RPSA algorithm for both flat and hierarchical clustering. For the former case, we have compared the results obtained with those given by the Bipartite Spectral Graph Partitioning algorithm and Spherical kMeans.
Dhillon [2] represents the documentword matrix as a bipartite graph with edges running from the set of documents to the set of words. The problem of getting k coclusters is posed as partitioning the vertices of the graph into k sets so that the sum of the edge weights of edges joining vertices in different sets is minimized. Spherical kMeans (see [5]) uses cosine similarity measure between document vectors. TFIDF vectors are unitized to get the document vectors. Further, the centroid vectors too are unitized. A random number generator was used to choose documents which were made the initial centroids. The results produced by kmeans depend on the choice of the initial centroids. Thus for every data set considered, we have run the algorithm 10 times and have shown the average purity, as well as the value coresponding to the best run.
4.1 Flat Clustering
4.1.1 Data Sets
We have used the following three data sets for our experiments:
 Classic3: It contains 1400 aerospace systems abstracts from
the Cranfield collection, 1033 medical abstracts from the Medline collection
and 1460 information retrieval abstracts from the Cisi collection, making
up 3893 documents in all. It can be downloaded from
ftp://ftp.cs.cornell.edu/pub/smart.
 Yahoo_K1: It contains 2340 Reuters news articles downloaded
from Yahoo, in 1997. The articles are from 6 categories. There are 494 news
articles from
Health, 1389 from Entertainment, 141 from Sports, 114 from Politics, 60 from
Technology and 142 from Business. The data set is available at
ftp://ftp.cs.umn.edu/dept/users/boley/PDDPdata/docK.
 NewsGroups: It contains 250 documents each from the following 10 Usenet newsgroups, making up 2500 documents in all: alt.atheism, comp.graphics, ibm.pc.hardware, rec.autos, rec.sport.baseball, sci.crypt, sci.electronics, sci.med, talk.mideast.politics and sci.space.
In addition, we have used a subset of Classic3 made up of 30 documents from each of its 3 sources. We will refer to this 90 document set as 90docs. The idea is to see how the algorithms behave for small data sets. The results of Spectral Graph Partitioning have been taken from [2], and thus are available only for Classic3 and Yahoo_K1.

4.1.2 Experimental Results
Below we have shown the confusion matrices for RPSA and Graph Partitioning for the Classic3 data set. Note that we can consider Classic3 a ``clean'' data set in the sense that its sources deal with disparate subjects (aerospace, medicine and IR), and the ideal clustering (100% purity clustering) has balanced cluster sizes. All three perform well, with kMeans giving the best results with an average purity of 99.13% (see Table 6), while RPSA gives 98.41%. The labels generated by the two coclustering algorithms are similar, and give good desciption of the clusters.


Next, we show the confusion matrices for the Yahoo_K1 data set. Yahoo_K1 is a much more ``noisy'' data set. Moreover, nearly 1900 of the 2340 documents come from just 2 of the 6 categories, namely Health and Entertainment. So, the desired clustering is skewed.
As Table 6 shows, RPSA gives the best results, and is on a par with the best of the 10 kMeans runs with a purity of over 85%. Graph Partitioning gives 79% purity, and is outperformed by the average kMeans run. In fact, as the confusion matrices below show, its performance is actually even worse than the purity comparisons suggest. For RPSA, 5 of the 6 clusters formed are good clusters i.e. they are made up of one dominant category, with negligible or no contributions from the other five. Two of these extract the entire Health and Sports categories respectively. Graph Partitioning on the other hand, gives only 3 good clusters out of 6. Moreover, none of these extract a complete category. Finally, the quality of its labels too can be seen to be inferior to that produced by RPSA, for which not a single label is out of place.


In NewsGroups data set, there is noise between categories like comp.graphics and pc.hardware,and the number of categories (ten) is relatively large. This is reflected in the low purity values. RPSA outperforms the average kMeans run, but is outperformed by the best kMeans run.
Finally, kMeans is known to perform poorly on small data sets, and this is verified by the results obtained with 90docs. RPSA easily outperforms the best kMeans runs. This poor kMeans performance is because it settles down at an inferior local maxima after 23 iterations.
4.2 Hierarchical Clustering
RPSA naturally generates a hierarchy while generating flat clusters. In this subsection, we evaluate the performance of RPSA in creating hierarchies. At different levels of hierarchy, we are interested in evaluating the purity, the number of nodes, and the labels generated for the nodes. Note that for calculating the purity, the nodes of the corresponding level in the original hierarchy are considered the external class labels.
4.2.1 Data Sets
We have used two data sets downloaded from the dmoz web directory.
 dmoz1800: It is a two level hierarchy with 3 nodes at level one, each of
which has 3 children, giving 9 nodes at level two. Each of the level two nodes have
200 documents, making up 1800 documents in all.
Arts(600)
Visual Arts(200)
Music(200)
Performing Arts(200)
Business(600)
Management(200)
Accounting(200)
Venture Capital(200)
Health(600)
Alternative(200)
Fitness(200)
Medicine(200)
 dmoz300: This is again a two level hierarchy, with 2 nodes at level one,
and 6 at level two. This is a small 300 document data set.
Arts(150)
Movies(50)
Music(50)
Performing Arts(50)
Business(150)
Management(50)
Accounting(50)
International Business & Trade(50)
4.2.2 Experimental Results

The hierarchy generated for dmoz1800 is as follows. The numbers in parentheses are the number of documents at a node.
Art(719)
Juggling(171)
Jazz(61)
Dance(210)
Artists(277)
Ventures(502)
Management(115)
Tax(174)
Capital(161)
Leadership(52)
Health(399)
Herbal(75)
Healing(117)
Care(207)
Training(180)
Exercise(54)
Chartered(50)
Partners(76)

For dmoz1800, as can be seen from the confusion matrix, nodes ``Art'' and ``Ventures'' respectively extract the ``Arts'' and ``Business'' categories of the original hierarchy. However, the ``Health'' node of the original hierarchy splits into two distinct nodes, ``Health'' and ``Training'', and also has a component in the ``Art'' node. This additional Level 1 node introduces further additional nodes at Level 2, giving 14 Level 2 nodes in all. The Level 1 purity is a bit low. However, the Level 2 purity of over 60% is quite acceptable, considering that the number of class labels is 9, and that we can expect noise between subcategories of the same Level 1 node, for example, between Music and Performing Arts. Finally, the generated taxonomy looks good. Most node labels can be easily mapped to nodes in the original hierarchy. Mapped Level 2 nodes have mapped parents.
The hierarchy generated for dmoz300 is:
Dance(140)
Film(32)
Hip(23)
Folk(85)
Tax(160)
Performance(25)
CPA(63)
Trade(21)
Quality(44)
Strategy(7)
The dmoz300 results show that performance does not degrade for small data sets, as they probably would for a kMeans based hierarchical clustering scheme. One important point to note is that the number of leaf clusters formed is comparable for the two data sets, despite one of them being much larger than the other. Similarly, in the previous subsection, though the number of documents varied from 90 to 3893, the number of leaf clusters formed varies only from 32 to 117. Thus, as mentioned earlier, the leaf cluster sizes scale up with the number of documents, and the total number m of leaf clusters formed grows quite slowly with the data set size. As our analysis of the running time of RPSA showed, this would be important for running RPSA on larger data sets.
5 Related Work
The kmeans algorithm is an iterative partitioning method. In each iteration, documents are assigned to the cluster with the nearest centroid. Convergence occurs when there is no further movement of points. [4,5] describe the use of kmeans for document clustering. They use the cosine measure between document vectors as the document similarity, and call the resulting algorithm spherical kmeans. The kmeans algorithm is a gradient descent scheme, which looks to minimize the sum of the distances of data points from the centroid of the cluster they belong to. Though it generally performs quite well, it can get stuck at inferior local minima. This is especially the case with small data sets. Its running time is O(zk), where z is the number of nonzero entries in the matrix M (see [4]), and k is the number of desired clusters. We earlier saw the running time of the partitioning step of RPSA to be O(zm). However, kmeans runs much faster than the partitioning step of RPSA. This is because the number of times the z entries of M are considered is larger for RPSA. Moreover, the number of leaf clusters m too is much larger than k. Kmeans is a partitioning algorithm, and does not create hierarchies.
In hierarchical agglomerative clustering (HAC) methods, performance is determined by the choice of similarity measure between clusters. These include:
 Centroid Similarity: The (cosine) similarity between cluster centroids is the similarity between the clusters.
 IntraCluster Similarity: The ``coherence'' of the clustering is the sum of the similarities of each data point with the centroid of the cluster it belongs to. We merge the two clusters for which the decrease in the ``coherence'' of the clustering is smallest.
 UPGMA: The similarity between two clusters is the average similarity of document pairs drawn from the cross product of the two clusters.
The only algorithm known to the authors that is similar in spirit to the partitioning step of RPSA is the Clustering By Committee (CBC) algorithm proposed in [12]. It has two distinct phases. In one phase, tight clusters called committees, are extracted. The algorithm attempts to form as many committees as it can subject to the condition that the similarity between any two is less than a threshold . Cosine similarity is used, and the similarity between two committees is the similarity between their centroids. To create these committees, for each uncovered document, the algorithm considers the topk similar documents for some fixed k, and extracts from it an optimal cluster. This cluster becomes a committee if its similarity with each of the existing committees is less than . When this has been done, each uncovered document whose maximum similarity with a committee centroid is greater than a threshold is assigned to the corresponding cluster. This process is repeated till either no more committees can be formed or no uncovered documents remain. The leaf clusters created in step 2 of the partitioning step of RPSA (see 3.1.4) are loosely analogous to the committees of CBC. Our choice of a document having minimum similarity with the existing leaf clusters as the next leader in step 3 is analogous to the condition that each new committee's similarity with each of the existing committees is less than . Finally, in the next phase of CBC, each uncovered document is assigned to the cluster with whose committee it has greatest similarity. In step 4 of our partitioning step, we similarly assign each uncovered document to the leaf cluster over whose column set the document has greatest density.
[2] proposes an algorithm based on spectral graph partitioning. This algorithm, like the partitioning step of RPSA, is a coclustering algorithm and does not use or define any notion of similarity between individual documents. It models a document collection as a weighted bipartite graph with each edge running from a document vertex to a word vertex. The edge weight is the corresponding entry in matrix M. The cut of vertex sets U, V of the graph is the sum of weights of edges joining vertices in U and V. The problem of getting two clusters of the vertex set is posed as the problem of minimizing the cut of the vertex subsets, while keeping their sizes balanced. The second eigenvector of the Laplacian matrix of the graph is a real approximation of the desired partition vector. When the number of desired clusters k is greater than 2, log_{2}k eigenvectors need to be computed.
Grouper [8] is an online clustering interface for web search engine results. Its clustering algorithm is suffix tree clustering (STC) [13], which models a document as a string of words, and not as a vector. Other algorithms include [10], a probabilistic hierarchical clustering algorithm, whose objective function is based on marginal likelihood.
6 Conclusions
In this paper, we have proposed an algorithm that interprets a dense submatrix of the documentword matrix as a documentword cocluster. In the first step, we partitioned the row set into a relatively large number of small dense submatrices or leaf clusters. Then, we merged them to get a smaller number of larger, sparser submatrices. We have also defined a criterion to choose labels from the word cluster of a cocluster.
In flat clustering, though their performances for the Classic3 data set were similar, for the noisier Yahoo_K1 data set, RPSA outperforms Spectral Graph Partitioning, with respect to both cluster purity and label quality. In fact, in all but one case, it is comparable with or better than the best kmeans run. Moreover, its performance does not degrade on small data sets. As a hierarchical clustering algorithm, RPSA does a good job in the hierarchy creation and node labeling. As we go down in a hierarchy, the number of nodes shoots up, and it becomes difficult to obtain high purity values. RPSA was found to give acceptable purity values at Level 2 of the two dmoz data sets.
The main application of RPSA would be for browsing large, unknown document collections. Another would be in organising search results. Finally, we can exploit the fact that ours is a coclustering algorithm for feature selection in unsupervised learning. Every document belongs to a document cluster, and has an associated word cluster, which can be used as its set of features.
Bibliography
 1

A. K. Jain and R. C. Dubes,
Algorithms for Clustering Data,
Prentice Hall, Englewood Cliffs, New Jersy, 1989.  2

Inderjit S. Dhillon,
``Coclustering documents and words using bipartite spectral graph partitioning,''
in Proc. of the Seventh ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2001, pp. 269274.  3

H. Zha, Chris Ding, and Ming Gu,
``Bipartite graph partitioning and data clustering,''
in Proceedings of ACM 10th International Conf. on Information and Knowledge Managment (CIKM), Atlanta, 2001, pp. 2531.  4

Inderjit S. Dhillon, James Fan, and Yuqiang Guan,
``Efficient clustering of very large document collections,''
in Data Mining for Scientific and Engineering Applications, R. Grossman, G. Kamath, and R. Naburu, Eds. Kluwer Academic Publ., 2001.  5

Inderjit S. Dhillon and Dharmendra S. Modha,
``Concept decompositions for large sparse text data using clustering,''
Tech. Rep. RJ 10147, IBM Almaden Research Center, August 1999.  6

Michael Steinbach, George Karypis, and Vipin Kumar,
``A comparison of document clustering techniques,''
in KDD Workshop on Text Mining, 2000.  7

P. Willett,
``Recent trends in hierarchical document clustering: A critical review,''
Information Processing and Management, vol. 24, pp. 577597, 1988.  8

Oren Zamir and Oren Etzioni,
``Grouper: a dynamic clustering interface to Web search results,''
Computer Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 1116, pp. 13611374, 1999.  9

Marti A. Hearst and Jan O. Pedersen,
``Reexamining the cluster hypothesis: Scatter/gather on retrieval results,''
in Proceedings of SIGIR96, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, 1996, pp. 7684.  10

Shivakumar Vaithyanathan and Byron Dom,
``Hierarchical unsupervised learning,''
in The Seventeenth International Conference on Machine Learning (ICML2000), July 2000.  11

Shigeru Oyanagi, Kazuto Kubota, and Akihiko Nakase,
``Application of matrix clustering to web log analysis and access prediction,''
in WEBKDD, August 2001.  12

Patrick Pantel and Dekang Lin,
``Document clustering with committees,''
in Proc. of the 25th international ACM SIGIR conference on Research and Development in Information Retrieval. 2002, pp. 199206, ACM Press.  13

Oren Zamir and Oren Etzioni,
``Web document clustering: A feasibility demonstration,''
in Proc. of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 4654.
About this document ...
A Matrix Density Based Algorithm to Hierarchically CoCluster Documents and WordsThis document was generated using the LaTeX2HTML translator Version 2K.1beta (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html no_math html_version 4.0,math split +0 link 1 show_section_numbers local_icons antialias_text rpsa.tex
The translation was initiated by on 20030411
20030411