WWW2006, Edinburgh, UK 8 2006
2
Probabilistic Models for Discovering ECommunities
tabularp35 em
tabularp10em
The Pennsylvania State University, IST Building
University Park, PA 16802
tabularp10em
The Pennsylvania State University, Thomas Building
University Park, PA 16802
tabularp10em
The Pennsylvania State University, IST Building
University Park, PA 16802
Date: 30 July 1999
Abstract:
H.4Information Systems ApplicationsMiscellaneous G.3Probability and StatisticsModels J.4Social and Behavioral Sciences
Algorithms, Experimentation, Human Factors
1 Introduction
Social network analysis is an established field in sociology [23]. The increasing availability of social network data has led to more computational research in social network analysis (SNA), e.g., discovering interpersonal relationships based on various modes of information carriers, such as emails [22,27], message boards [10] and the Web [3]. Analysis of social networks (SNs) can be applied to many domains including viral marketing [4,17] and the evaluation of importance of social actors [24].
0.4emailSociety.eps

An important characteristic of all SNs is the community graph structure: how social actors gather into groups such that they are intragroup close and intergroup loose [13]. An illustration of a simple twocommunity SN is sketched in Fig. 1. Here each node represents a social actor in the SN and different node shapes represent different communities. Two nodes share an edge if and only if a relationship exists between them according to social definitions such as their role or participation in the social network. Connections in this case are binary.
Discovering community structures from general networks is of obvious interest. Early methods include graph partitioning [8] and hierarchical clustering [19,25]. Recent algorithms [6,14,2] addressed several problems related to prior knowledge of community size, the precise definition of intervertices similarity measure and improved computational efficiency [13]. They have been applied successfully to various areas such as email networks [22] and the Web [5]. Semantic similarity between Web pages can be measured using humangenerated topical directories [9]. In general, semantic similarity in SNs is the meaning or reason behind the network connections.
For the extraction of community structures from email corpora [22,3], the social network is usually constructed measuring the intensity of contacts between email users. In this setting, every email user is a social actor, modeled as a node in the SN. An edge between two nodes indicates that the existing email communication between them is higher than certain frequency threshold.
Consider other ways a community can be established, e.g. Fig. 2. From the preset communication intensity, person and person belong to two different communities, denoted by squares and circles, based on a simple graph partitioning. However, ignoring the document semantics in their communications, their common interests (denoted by the dashed line) are not considered in traditional community discovery.
In this paper, we examine the inner community property within SNs by analyzing the semantically rich information, such as emails or documents. We approach the problem of community detection using a generative Bayesian network that models the generation of communication in an SN. As suggested in established social science theory [23], we consider the formation of communities as resulting from the similarity among social actors. The generative models we propose introduce such similarity as a hidden layer in the probabilistic model.
2 Related Work and Contributions
In the first part of this section, we introduce the related work on document modeling. Three related generative models based on which our models are built are described: TopicWord model, AuthorWord model and AuthorTopic model. In the second part of this section, we model the generation of SN communications, enabling us to propose a solution to the problem of semantic community identification.
[TopicWord (LDA)]
0.4topicModel.eps
[AuthorWord]
0.4authorModel.eps
[AuthorTopic]
0.4authortopicModel.eps

1 Related Work
Related work on document content characterization [1,7,11,21] introduces a set of probabilistic models to simulate the generation of a document. Several factors in producing a document, either observable (e.g. author [11]) or latent (e.g. topic [7,1]), are modeled as variables in the generative Bayesian network and have been shown to work well for document content characterization.Given a set of documents , each consisting of a sequence of words of size , the generation of each word for a specific document can be modeled from the perspective of either author or topic, or the combination of both. Fig. 3 illustrates the three possibilities using plate notations. Let denote a specific word observed in document ; and represent the number of topics and authors; is the observed set of authors for . Note that the latent variables are lightcolored while the observed ones are shadowed. Fig. 3(a) models documents as generated by a mixture of topics [1]. The prior distributions of topics and words follow Dirichlets parameterized respectively by and . Each topic is a probabilistic multinomial distribution over words. Let denote the topic's distributions over words while the document's distribution over topics. ^{1}
Similar to the TopicWord model, an AuthorWord model prioritizes the author interest as the origin of a word [11]. In Fig. 3(b), is the author set that composes document . Each word in this is chosen from the authorspecific distribution over words. Note that in this AuthorWord model, the author responsible for a certain word is chosen at random from .
An influential work following this model [21] introduces the AuthorTopic model combined with the TopicWord and AuthorWord models and regards the generation of a document as affected by both factors in a hierarchical manner. Fig. 3(c) presents the hierarchical Bayesian structure.
2 Contributions
We address the problem of identifying social actors based on the semantics of the communications, in particular the semantics as they are related to their communication documents.Much communication in SNs usually occurs by exchanging documents, such as emails, instant messages or posts on message boards [15]. Such content rich documents naturally serve as an indicator of the innate semantics in the communication among an SN. Consider an information scenario where all communications rely on email. Such email documents usually reflect nearly every aspect of and reasons for this communication. For example, the recipient list records the social actors that are associated with this email and the message body stores the topics they are interested in.
Modeling communication based on communication document takes into consideration the semantic information of the document as well as the interactions among social actors. Many features of the SN can be revealed from the parameterized models such as the leaderfollower relation [12]. Using such models, we can avoid the effect of meaningless communication documents, such as those generated by a network spammer, in producing communities.
3 CommunityUserTopic Models
Our definition for a semantic community in a social network is:
We study the community structure of an SN by modeling the communication documents among its social actors and the format of communication documents we model is email because emails embody valuable information regarding shared knowledge and the SN infrastructure [22].
1 CUT: Modeling Community with Users
Given the impact of community in the generation of communication, the first step is to determine the interrelationships among this latent variable, the email users and the topics, i.e. the structure of the Bayesian network.We first consider an SN community as no more than a group of users. This is a notion similar to that assumed in a topologybased method. For a specific topologybased graph partitioning algorithm such as [13], the connection between two users can be simply weighted by the frequency of their communications. In our first model CUT, we treat each community as a multinomial distribution over users. Each user is associated with a conditional probability which measures the degree that belongs to community . The goal is therefore to find out the conditional probability of a user given each community. Then users can be tagged with a set of topics, each of which is a distribution over words. A community discovered by CUT is typically in the structure as shown in Fig. 8.
Typically, an email message is generated by four steps: (1) there is a need for a community to issue an act of communication by sending an email ; (2) a user is chosen from as observed in the recipient list in ; (3) presents to read since a topic is concerned, which is drawn from the conditional probability on over topics; (4) given topic , a word is created in . By iterating the same procedure, an email message is composed word by word.
Note that the is not necessarily the composer of the message in our models. This differs from existing literatures which assume as the author of document. The assumption is that a user is concerned with any word in a communication document as long as the user is on the recipient list.
To compute , the posterior probability of assigning each word to a certain community , user and topic , consider the joint distribution of all variables in the model:
(1) 
Theoretically, the conditional probability can be computed using the joint distribution .
A possible sideeffect of CUT, which considers a community solely as a multinomial distribution over users, is it relaxes the community's impact on the generated topics. Intrinsically, a community forms because its users communicate frequently and in addition they share common topics in discussions as well. In CUT where community only generates users and the topics are generated conditioned on users, the relaxation is propagated, leading to a loose connection between community and topic. We will see in the experiments that the communities discovered by CUT is similar to the topologybased algorithm proposed in [13].
2 CUT: Modeling Community with Topics
In contrast to CUT, our second model introduces the notion that an SN community consists of a set of topics, which are of concern to respective user groups.As illustrated in Fig. 5, each word observed in email is finally chosen from the multinomial distribution of a user , which is from the recipient list of . Before that, is sampled from another multinomial of topic and is drawn from community 's distribution over topics.
Analogously, the products of CUT are a set of conditional probability that determines which of the topics are most likely to be discussed in community . Given a topic group that associates for each topic , the users who refer to can be discovered by measuring .
CUT differs from CUT in strengthing the relation between community and topic. In CUT, semantics play a more important role in the discovery of communities. Similar to CUT, the sideeffect of advancing topic in the generative process might lead to loose ties between community and users. An obvious phenomena of using CUT is that some users are grouped to the same community when they share common topics even if they correspond rarely, leading to the different scenarios for which the CUT models are most appropriate. For CUT, users often tend to be grouped to the same communities while CUT accentuates the topic similarities between users even if their communication seem less frequent.
0.4cut2.eps

Derived from Fig. 5, define in CUT the joint distribution of community , user , topic and word :
(2) 
Let us see how these models can be used to discover the communities that consist of users and topics. Consider the conditional probability , a word associates three variables: community, user and topic. Our interpretation of the semantic meaning of is the probability that word is generated by user under topic , in community .
Unfortunately, this conditional probability can not be computed directly. To get ,we have:
(3) 
Consider the denominator in Eq. 3, summing over all , and makes the computation impractical in terms of efficiency. In addition, as shown in [7], the summing doesn't factorize, which makes the manipulation of denominator difficult. In the following section, we will show how an approximate approach of Gibbs sampling will provide solutions to such problems. A faster algorithm EnFGibbs sampling will also be introduced.
4 Semantic Community Discovery: The Algorithms
In this section, we first introduce the Gibbs sampling algorithm. Then we address the problem of semantic community discovery by adapting Gibbs sampling framework to our models. Finally, we combine two powerful ideas: Gibbs sampling and entropy filtering to improve efficiency and performance, yielding a new algorithm: EnFGibbs sampling.
1 Gibbs Sampling
Gibbs sampling is an algorithm to approximate the joint distribution of multiple variables by drawing a sequence of samples. As a special case of the MetropolisHastings algorithm [18], Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional probability distribution of each variable can be evaluated. Rather than explicitly parameterizing the distributions for variables, Gibbs sampling integrates out the parameters and estimates the corresponding posterior probability.Gibbs sampling was first introduced to estimate the TopicWord model in [7]. In Gibbs sampling, a Markov chain is formed, the transition between successive states of which is simulated by repeatedly drawing a topic for each observed word from its conditional probability on all other variables. In the AuthorTopic model, the algorithm goes over all documents word by word. For each word , the topic and the author responsible for this word are assigned based on the posterior probability conditioned on all other variables: . and denote the topic and author assigned to , while and are all other assignments of topic and author excluding current instance. represents other observed words in the document set and is the observed author set for this document.
(4)  
(5)  
(6) 
where and , and are prior parameters for word and topic Dirichlets, represents the number of times that word is assigned to topic , represents the number of times that author is assigned to topic .
The transformation from Eq. 4 to Eq. 5 drops the variables, , , , , because each instance of is assumed independent of the other words in a message.
2 Semantic Community Discovery
By applying the Gibbs sampling, we can discover the semantic communities by using the CUT models. Consider the conditional probability , where three variables in the model, community, user^{4} and topic, are associated by a word . The semantic meaning of is the probability that belongs to user under topic , in community . By estimation of , we can label a community with semantic tags (topics) in addition to the affiliated users. The problem of semantic community discovery is thus reduced to the estimation of .
The framework of Gibbs sampling is illustrated in Fig. 6. Given the set of users , set of email documents , the number of desired topic , number of desired community are input, the algorithm starts with randomly assigning words to a community, user and topic. A Markov chain is constructed to converge to the target distribution. In each trial of this Monte Carlo simulation, a block of is assigned to the observed word . After a number of states in the chain, the joint distribution approximates the targeted distribution.
To adapt Gibbs sampling for CUT models, the key step is estimation of . For the two CUT models, we describe the estimation methods respectively.
Let be the probability that is generated by community , user on topic , which is conditioned on all the assignments of words excluding the current observation of . , and represent all the assignments of topic, user and word not including current assignment of word .
In CUT, combining Eq. 1 and Eq. 3, assuming uniform prior probabilities on community , we can compute for CUT by:
(7) 
where , and are estimated via:
(8)  
(9)  
(10) 
In the equations above, is the number of times that word is assigned to topic , not including the current instance. is the number of times that topic is associated with user and is the number of times that user belongs to community , both not including the current instance. is the number of communities in the social network given as an argument.
The computation for Eq. 8 requires keeping a matrix , each entry of which records the number of times that word is assigned to topic . Similarly, a matrix and a matrix are needed for computation in Eq. 9 and Eq. 10.
Similarly, is estimated based on the Bayesian structure in CUT:
(11) 
Hence the computation of CUT demands the storage of three 2D matrices: , and .
With the set of matrices obtained after successive states in the Markov chain, the semantic communities can be discovered and tagged with semantic labels. For example, in CUT, the users belonging to each community can be discovered by maximizing in . Then the topics that these users concern are similarly obtained from and explanation for each topic can be retrieved from .
3 Gibbs Sampling with Entropy Filtering
In this section, we further develop Gibbs sampling to improve computational efficiency and performance.Consider two problems with Gibbs sampling illustrated in Fig. 6: (1) efficiency: Gibbs sampling has been known to suffer from high computational complexity. Given a textual corpus with words. Let there be users, communities and topics. An iteration Gibbs sampling has the worst time complexity , which in this case is about computations. (2) performance: unless performed explicitly before Gibbs sampling, the algorithm may yield poor performance by including many nondescriptive words. For Gibbs sampling, some common words like 'the', 'you', 'and' must be cleaned before Gibbs sampling. However, the EnFGibbs sampling saves such overhead by automatically removing the noninformative words based on entropy measure.
Fig. 7 illustrates the EnFGibbs sampling algorithm we propose. We incorporate the idea of entropy filtering into Gibbs sampling. During the interactions of EnFGibbs sampling, the algorithm keeps in an index of words that are not informative. After times of iterations, we start to ignore the words that are either already in the or are noninformative. In Step 15 of Fig. 7, we quantify the informativeness of a word by the entropy of this word over another variable. For example, in CUT where keeps the numbers of times is assigned to all topics, we calculate the entropy on the row of the matrix.
5 Experiments
We present experimental results of our models with the Enron email corpus. Enron email dataset was made public by the Federal Energy Regulatory Commission during its investigations and subsequently made available [20]In this section, we present examples of discovered semantic communities. Then we compare our communities with those discovered by the topologybased algorithm [2] by comparing groupings of users. Finally we evaluate the computational complexity of Gibbs sampling and EnFGibbs sampling for our models.
1 Semantic Community Representation
We preprocessed the Enron email dataset by removing the common stop words. Each employee in Enron is identified by an email address. For brevity, we use only the email ids without organization suffixes hereafter.
In all of our simulations, we fixed the number of communities at 6 and the number of topics at 20. The smoothing hyperparameters , and were set at , 0.01 and 0.1 respectively. We ran 1000 iterations for both our Gibbs sampling and EnFGibbs sampling with the MySQL database support. Because the quality of results produced by Gibbs sampling and our EnFGibbs sampling are very close, we simply present the results of EnFGibbs sampling hereafter.
0.6comOne.eps

The ontologies for both models are illustrated in Fig. 8 and Fig. 11. In both figures, we denote user, topic and community by square, hexagon and dot respectively. In CUT results, a community connects a group of users and each user is associated with a set of topics. A probability threshold can be set to tune the number of users and topics desired for description of a community. In Fig. 8, we present all the users and two topics of one user for a discovered community. By union all the topics for the desired users of a community, we can tag a community with topic labels.
Topic 3  Topic 5  Topic 12  Topic 14 

rate  dynegy  budget  contract 
cash  gas  plan  monitor 
balance  transmission  chart  litigation 
number  energy  deal  agreement 
price  transco  project  trade 
analysis  calpx  report  cpuc 
database  power  group  pressure 
deals  california  meeting  utility 
letter  reliant  draft  materials 
fax  electric  discussion  citizen 
Table 1: Topics Discovered by CUT
Fig. 8 shows that user mike.grigsby is one of the users in community 3. Two of the topics that is mostly concerned with mike.grigsby are topic 5 and topic 12. Table 1 shows explanations for some of the topics discovered for this community. We obtain the word description for a topic by choosing 10 from the top 20 words that maximize . We only choose 10 words out of 20 because there exist some names with large conditional probability on a topic that for privacy concern we do not want to disclose.
abbreviations  organizations 

dynegy  An electricity, natural gas provider 
transco  A gas transportation company 
calpx  California Power Exchange Corp. 
cpuc  California Public Utilities Commission 
ferc  Federal Energy Regulatory Commission 
epsa  Electric Power Supply Association 
naruc  National Association of 
Regulatory Utility Commissioners  
Table 2: Abbreviations
We can see from Table 1 that words with similar semantics are nicely grouped to the same topics. For better understanding of some abbreviate names popularly used in Enron emails, we list the abbreviations with corresponding complete names in Table 2.
[Over communities]
0.35userCom.eps
[Over topics]
0.35harry.arora.eps

For a single user, Fig. 9 illustrates its probability distribution over communities and topics as learned from the CUT model. We can see the multinomial distribution we assumed was nicely discovered in both figures. The distribution over topics for all users are presented in Fig. 10. From Fig. 10, we can see some Enron employees are highly active to be involved in certain topics while some are relatively inactive, varying in heights of peaks over users.
0.4tu.eps

Fig. 11 illustrates a community discovered by CUT. According to the figure, Topic 8 belongs to the semantic community and this topic concerns a set of users, which includes rick.buy whose frequently used words are more or less related to business and risk. Surprisingly enough, we found the words our CUT learned to describe such users were very appropriate after we checked the original positions of these employees in Enron. For the four users presented in Table 3, d..steffes was the vice president of Enron in charge of government affairs; cara.semperger was a senior analyst; mike.grigsby was a marketing manager and rick.buy was the chief risk management officer.
0.65comTwo.eps

d..steffes  cara.s  mike.grigsby  rick.buy 

power  number  file  corp 
transmission  cash  trader  loss 
epsa  ferc  report  risk 
ferc  database  price  activity 
generator  peak  customer  validation 
government  deal  meeting  off 
california  bilat  market  business 
cpuc  caps  sources  possible 
electric  points  position  increase 
naruc  analysis  project  natural 
Table 3: Distribution over words of some users
2 Semantic Community Discovery Quality
We evaluate the quality of discovered communities against the topologybased algorithm in [2], a hierarchical agglomeration algorithm for community structure detection. The algorithm is based on , which is a measurement of whether a division of a network is a good one, in the sense that there are many edges within communities and only a few between them. We employ the clustering comparison method in [16] to measure the similarity between our communities and the clusters of users produced by [2].Given data objects, the similarity between two clustering results is defined^{5}:
where denotes the count of object pairs that are in different clusters for both clustering and is the count of pair that are in the same cluster.
0.25[height=5in, width=10in]sim.eps

The similarities between three CUT models and are illustrated in Fig. 12. We can see that as we expected the similarity between CUT and is large while that between CUT and is small. This is because the CUT is more similar to than CUT by defining a community as no more than a group of users.
We also test the similarity among topics(users) for the users(topics) which are discovered as a community by CUT (CUT). Typically the topics(users) associated with the users(topics) in a community represent high similarities. For example, in Fig. 8, Topic 5 and Topic 12 that concern mike.grigsby are both contained in the topic set of lindy.donoho, who is the community companion of mike.grigsby.
3 Computational Complexity and EnFGibbs Sampling
[Time vs. sample size]
0.4cut1time.eps
[Time vs. iterations]
0.4itertime.eps

We evaluate the computational complexity of Gibbs sampling and EnFGibbs sampling for our models. For the two metrics we measure the computational complexity based on are total running time and iterationwise running time. For overall running time we sampled different scales of subsets of messages from Enron email corpus. For the iterationwise evaluation, we ran both Gibbs sampling and EnFGibbs sampling on complete dataset.
In Fig. 13(a), the running time of both sampling algorithms on two models are illustrated. We can see that generally learning CUT is more efficient than CUT. It is a reasonable result considering the matrices for CUT are larger in scales than CUT. Also entropy filtering in Gibbs sampling leads to 4 to 5 times speedup overall.
The stepwise running time comparison between Gibbs sampling and EnFGibbs sampling is shown in Fig. 13(b). We perform the entropy filtering removal after 8 iterations in the Markov chain. We can see the EnFGibbs sampling well outperforms Gibbs sampling in efficiency. Our experimental results also show that the quality of EnFGibbs sampling and Gibbs sampling are almost the same.
6 Conclusions AND FUTURE WORK
We present two versions of CommunityUserTopic models for semantic community discovery in social networks. Our models combine the generative probabilistic modeling with community detection. To simulate the generative models, we introduce EnFGibbs sampling which extends Gibbs sampling based on entropy filtering. Experiments have shown that our method effectively tags communities with topic semantics with better efficiency than Gibbs sampling.
0.4cut3.eps

Future work would consider the possible expansion of our CUT models as illustrated in Fig. 14. The two CUT models we proposed either emphasize the relation between community and users or between community and topics. It would be interesting to see how the community structure changes when both factors are simultaneously considered. One would expect new communities to emerge. The model in Fig. 14 constrains the community as a joint distribution over topic and users. However, such nonlinear generative models require larger computational resources, asking for more efficient yet approximate solutions. It would also be interesting to explore the predictive performance of these models on new communications between strange social actors in SNs.
Bibliography

 1

D. Blei, et al.,
Latent Dirichlet allocation,
Journal of Machine Learning Research, 3, 9931022, 2003.
 2

Aaron Clauset, et al.,
Finding community structure in very large networks,
Phys. Rev. E 70, 066111, 2004.
 3

Aron Culotta, et al.,
Extracting social networks and contact information from email and the Web,
In First Conference on Email and AntiSpam, Mountain View, CA, USA. July 2005.
 4

P. Domingos, et al.,
Mining the network value of customers,
In 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 5766, ACM Press, 2001.
 5

G. Flake, S. Lawrence and C.L. Giles,
Efficient Identification of Web Communities,
In 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 150160, 2000.
 6

M. Girvan and M. Newman,
Community structure in social and biological networks.
In Proceedings National Academic Science, USA 99, 78217826, 2002.
 7

T. Griffiths,
Finding scientific topics,
In National Academy of Sciences, 52285235, 2004.
 8

B. W. Kernighan,
An efficient heuristic procedure for partitioning graphs,
Bell System Technical Journal, 49, 291307, 1970.
 9

Ana Maguitman, et al.,
Algorithmic detection of semantic similarity,
In Proceedings of the 14th international conference on World Wide Web (WWW 2005)). Chiba, Japan, May. 2005.
 10

Naohiro Matsumura, et al.,
Mining Social Networks in Message Boards,
In Proceedings of the 14th International Conference on World Wide Web (WWW 2005), Chiba, Japan, May. 2005.
 11

A. McCallum,
Multilabel text classification with a mixture model trained by EM,
In AAAI Workshop on Text Learning, 1999.
 12

A. McCallum, et al.,
The AuthorRecipientTopic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email,
Technical Report, Computer Science, University of Massachusetts Amherst, 2004.
 13

Mark Newman,
Fast algorithm for detecting community structure in networks,
Phys. Rev., E, 2004.
 14

Mark Newman,
Detecting community structure in networks,
Eur. Phys. J. B 38, 321330, 2004.
 15

Mike Perkowitz, et al.,
Mining models of human activities from the Web,
In Proceedings of the 13th International Conference on World Wide Web (WWW 2004), New York, NY, USA, 2004.
 16

W.M Rand,
Objective criteria for the evaluation of clustering methods,
Journal of American Statistical Association, 66:846850, 1971.
 17

M. Richardson, et al.,
Mining knowledgesharing sites for viral marketing,
In 8th ACM SIGKDD International Conference on Knowledge Discovery and Data mining,
6170, ACM Press, 2002.
 18

Christian P. Robert and George Casella,
Monte Carlo Statistical Methods, Springer Publisher.
 19

J. Scott, Social Network Analysis: A Handbook, Sage, London, 2nd edition, 2000.
 20

Jitesh Shetty, et al.,
The Enron Email Dataset Database Schema and Brief Statistical Report, Information Sciences Institute, 2004.
 21

Mark Steyvers, et al.,
Probabilistic authortopic models for information discovery,
In 10th ACM SIGKDD international conference on Knowledge Discovery and Data mining, 306315, Seattle,WA, 2004.
 22

Joshua R. Tyler, et al.,
Email as Spectroscopy: Automated Discovery of Community Structure within Organizations,
Communities and technologies,8196,2003.
 23

Stanley Wasserman and Katherine Faust,
Social Network Analysis: Methods and Applications,
Cambridge University Press, 1994.
 24

S. White, et al.,
Algorithms for estimating relative importance in networks,
In 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining,
266275, 2003.
 25

Andrew Y. Wu, et al., Mining scalefree networks using geodesic clustering, In Proceedings of the 10th SIGKDD international conference on Knowledge discovery and data mining, 719724, Seattle, Washington, USA, 2004.
 26

Ding Zhou, et al., A New Mallows distance based Metric for Comparing Clusterings, In Proceedings of the 22rd International Conference on Machine Learning (ICML 2005), 8pp, Bonn, Germany, 2005.
 27

Ding Zhou, et al., Towards Discovering Organizational Structure from Email Corpus, In Proceedings of the 4th IEEE International Conference on Machine Learning and Applications, 8pp, Los Angeles , CA, U.S.A. 2005.
Footnotes
 ... topics.^{1}
 Usually the is represented using matrix, where and are the number of topics and size of vocabulary. Similarly is modeled as matrix.
 ....^{2}
 The TopicWord model was firstly introduced with name Latent Dirichlet Allocation (LDA). In consistency with the line of research, we use the alternative name.
 ... model^{3}
 In order to fit our model literally to the social network built on email communication, we change the name "Author" to "User". An alternative name of our model is CommunityAuthorTopic Model: CAT.
 ... user^{4}
 Note we denote user with in our models instead of as in previous work.
 ... defined^{5}
 Another recent work on comparing clusterings is defined introduced in [26]. But for our problem where cluster labels are categorical, both clustering comparison perform similarly as suggested.