E-mail has become the primary means of communication in many organizations. It is a rich source of information that could be used to improve the functioning of an organization. Hence, search and analysis of e-mail messages has drawn significant interest from the research community [5,2].
Specifically, e-mail messages can serve as a source for "expertise identification" , since they capture people's activities, interests, and goals in a natural way.
While early approaches to expert finding (i.e., identifying experts on a given topic) employed manually maintained databases, there has been a move towards unsupervised methods that use expertise indicators from documents produced within an organization; the resulting evidence of expertise is then used to build an employee's expertise profile.
Our main aim in this paper is to study the use of e-mail messages for mining expertise information. Our main findings are that (i) the fielded structure of e-mail messages can be effectively exploited to find pieces of evidence of expertise, which can then be successfully combined in a language modeling framework, and (ii) e-mail signatures are a reliable source of personal contact information.
The rest of the paper is structured as follows. In Section 2 we detail and assess our model of expert search. In Section 3 we harvest contact details for candidates by mining e-mail signatures. We conclude in Section 4.
2 Finding Experts
We model the expert finding task as follows: what is the probability of a candidate ca being an expert given the query topic q? Instead of computing this probability p(ca|q) directly, we can use Bayes' Theorem to rank candidates in proportion to p(q|ca), the probability of the query given the candidate. Below, we first detail our model, and then evaluate its effectiveness.
2.1 Modeling Expert Search
We first find documents (i.e., e-mail messages) which are relevant to the query topic and then score each candidate by aggregating over all documents associated with that candidate. That is, p(q|ca) is proportional to Sumd p(q|d) p(ca|d). To determine p(q|d), the probability of a query given a document, we use a standard language modeling for IR approach. To estimate the strength of the association between document d and candidate ca, p(ca|d), we assume that an association score a(d,ca) has been calculated for each document d and for each candidate ca. To turn these associations into probabilities, we put p(ca|d) = a(d,ca)/(Sumdi in D a(di,ca)), where D is a set of e-mail messages.
To compute the associations a(d,ca) we exploit the fact that our documents are e-mail messages. A list of candidate experts is created by extracting names and e-mail addresses from message headers. We introduce four binary association methods for deciding whether a document d and candidate ca are associated:
- [A0] EMAIL_FROM returns 1 if the candidate appears in the from field of the e-mail
- [A1] EMAIL_TO returns 1 if the candidate appears in the to field of the e-mail
- [A2] EMAIL_CC returns 1 if the candidate appears in the cc field of the e-mail
- [A3] EMAIL_CONTENT returns 1 if the candidate's name appears in the content of the e-mail message. The first and last names are obligatory; middle names are not.
Since A0-A3 are likely to capture different aspects of the relation between a document and a candidate expert, we also consider (linear) combinations of their outcomes. Hence, we put a(d,ca): = Sumi=0..3 pii Ai(d,ca), where the pii are weights.
2.2 Experimental Evaluation
We carried out experiments to answer the following question: how effective is our modeling approach for finding experts? The document collection we use is part of the W3C corpus , which was used at the 2005 TREC Enterprise track  and comes with a list of candidate experts, expert finding topics, and relevance assessments for these topics. For the purposes of our experiments, we restrict ourselves to the e-mail lists in the corpus, omitting other types of documents from the W3C corpus and candidate expert names that do not occur in the e-mail lists.
We conducted two sets of experiments: comparing the impact of the association methods on expert finding effectiveness, and examining the impact of combinations of these association methods. Table 1 contains the expert finding results for different association methods. The most effective association method is A0 (EMAIL_FROM), on all measures.
Assuming that different associations perform in complementary ways, we explored linear combinations of association methods; Table 2 reports a sample of results. Briefly, the main findings are (i) using the EMAIL_CONTENT method improves on the number of retrieved candidates, but hurts on other measures; (ii) extra weights on a single header field improves, but only on a subset of the measures; (iii) our best found combination (bottom row) improves on all measures. Surprisingly, the cc field has a great importance when it is used within a combination; the person being cc'd appears to be an authority on the content of the message.
3 Mining Contact Details
Once an expert has been determined, retrieving his/her contact details is a natural next component of an operational expert finder. We show that contact details can be effectively mined from e-mail signatures.
3.1 Extracting Signatures
One of the challenges of expert profiling is to maintain a database with the candidates' details. To address the issue, we mine the e-mail signatures. Many (but by no means all) contain reliable details about a person's affiliation and contact details.
Before mining signatures, we need to identify them. Our heuristics are precision-oriented; using the following heuristics we find a large number of signatures with valuable personal data: (i) signatures are placed at the end of the e-mails and separated from the message body with "-"; (ii) the length of a signature should be between 3 and 10 lines; (iii) it should contain at least one web address or tel/fax number; and (iv) signatures containing stop words (P.S., antivirus, disclaimer, etc.) or PGP keys are ignored.
3.2 Statistics on Experts' Details
How effective are our unsupervised methods for extracting personal information? Table 3 details the results of our signature mining experiments. ALL refers to all people found within the corpus, while W3C refers to people found that were on the list of candidate experts, provided by TREC. We restricted our identification method to find people that appear more than 5 times in e-mail headers.
|personal data found in signatures||1.492||246|
4 Conclusions and Further Work
We have presented methods for expertise identification using e-mail communications. Our expert modeling approach uses language modeling techniques and combines evidences of expertise. This method is very effective in terms of the number of relevant experts found. Possible further improvements concern determining more expertise indicators and using the thread structure of the e-mail lists. Our extraction method finds contact information for candidates using email signatures. In future work we plan to extract additional details, such as affiliation and address information.
 C. S. Campbell, P. P. Maglio, A. Cozzi, and B. Dom. Expertise identification using email communications. In CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management, pages 528-531. ACM Press, 2003.
 K. Mock. An experimental framework for email categorization and management. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 392-393, New York, NY, USA, 2001. ACM Press.
 Enterprise track, 2005. URL: http://www.ins.cwi.nl/projects/trec-ent/wiki/.
 W3C. The W3C test collection, 2005. URL: http://research.microsoft.com/users/nickcr/w3c-summary.html.
 S. Whittaker and C. Sidner. Email overload: exploring personal information management of email. In CHI '96: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 276-283, New York, NY, USA, 1996. ACM Press.