WWW10 Poster 1110

Finding topics in dynamical text: application to chat line discussions

Ella Bingham
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
ella.bingham@hut.fi
Ata Kabán
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
kaba-ci0@paisley.ac.uk
Mark Girolami
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
giro-ci0@paisley.ac.uk

ABSTRACT

The problem of analysing dynamically evolving textual data has recently arisen. An example of such data is the discussion appearing in Internet chat lines. In this paper a recently introduced method, termed complexity pursuit, is used to extract the topics of a dynamical chat line discussion. Experimental results demonstrate that meaningful topics can be found and also suggest the applicability of the method to query-based retrieval from a temporally changing text stream.

Keywords

chat line discussion, dynamical text, topic identification, complexity pursuit, data mining

1. INTRODUCTION

In times of huge information flow in the Internet, there is a strong need for automatic textual data analysis tools. Algorithms developed for text mining from static text collections have been presented e.g. in [1,2,7]. Our emphasis is in the recently arisen issue of analyzing dynamically evolving textual data; investigating appropriate tools for this task is of practical importance. An example of such data is found in the Internet relay chat rooms: the topic of interest changes after participants' contributions. The online text stream can thus be seen as a time series, and methods of time series processing may be used to extract the topics of the discussion.

We present results of applying a recently introduced powerful method, complexity pursuit [3], to topic extraction in a dynamically evolving discussion. Complexity pursuit uses both information-theoretic measures and time-correlations of the data, which makes it more powerful than methods using only one of these -- the latter kind of methods include [5,6,8,9].

2. CHAT LINE DATA

The discussion found in chat lines on the Internet is an ongoing stream of text generated by the chat participants and the chat line moderator. To analyze it using data mining methods a convenient technique is to split the stream into windows that may be overlapping if desired. Each such window can now be viewed as one document. We represent the documents using the vector space model [11]: each document forms one T-dimensional vector where T is the number of distinct terms in the vocabulary. The i-th element of the vector indicates (some function of) the frequency of the i-th vocabulary term in the document. The term by document matrix X contains the document vectors as its columns and is of size T times N where N is the number of documents.

As a preprocessing step we perform the LSI [2] of the data matrix X by computing its singular value decomposition, thus acquiring a lower dimensional projection of the the high-dimensional data. We denote the new data matrix as Z = (z(t)). The topics of the discussion can be found by projecting Z to the directions $\mathbf W = (\mathbf w_1 \cdots \mathbf w_M)$ given by the algorithm described in the following Section. M, the number of estimated minimum complexity projections, may be smaller than K, the dimension of the LSI projection.

3. THE ALGORITHM

Complexity pursuit [3] is a recently developed, computationally simple algorithm for separating "interesting" components from high dimensional time series. These components -- here, the topics of the discussion -- are found by projecting the data in the directions given by the algorithm. In complexity pursuit, an "interesting" component is one whose distribution has a low coding complexity, measured by an approximative Kolmogoroff complexity. We model the topics of the discussion as probability distributions on terms, and the distributions having minimum complexity are assumed to best represent the distinct topics.

We assume that the preprocessed K-dimensional observations z(t) are linear mixtures of some latent, time-correlated topics s. Both the latent topics and the mixing process are unknown. The topic time series are found by projecting the observation time series into directions w as s(t) = wT z(t) where the projection directions w are to be found. A separate autoregressive (AR) model, here $ \hat{s}(t) = \alpha s(t-1)$, is assumed to model each topic time series s. At every step of the algorithm, the AR constant $\alpha$ is first estimated. Then the gradient update of w that minimizes the approximate Kolmogoroff complexity [3] of the AR residuals $s(t)-\hat{s}(t)$ = wT(z(t) - \alpha z(t-1)) is the following:
  \begin{align}
\mathbf w \leftarrow & \mathbf w - \mu E\{ (\mathbf z(t) -
\alph...
...\\
\mathbf w \leftarrow & \mathbf w /\vert\vert\mathbf w\vert\vert
\end{align}
To estimate several projection directions we can proceed in a deflationary manner, making the new projection direction orthogonal to the previously found ones, similarly to Gram-Schmidt orthogonalization. After the estimation of p projections, we run the algorithm for wp+1 and after every iteration step subtract from wp+1 the projections of the previously estimated p vectors, and then renormalize wp+1. Another possibility is to estimate all projections simultaneously in a symmetric manner and use orthogonal decorrelation W := (WWT)-1/2 W instead of the normalization (2) in the above set of equations. Details of the approximation of the Kolmogoroff complexity and its minimization can be found in [3].

The computational load of the algorithm can be approximated as follows. The LSI preprocessing of a sparse T times N data matrix is of order O(NTc) where c is the number of nonzero entries in each column of the data matrix. (In our experiments, the data matrix is very sparse and c is about 30 to 40, as each document contains about 30 to 40 distinct vocabulary terms.) The algorithm itself is of order O(NK2M) where K is the dimensionality (K << T,N) into which the data was projected using LSI, and M is the number of topics estimated. Thus the algorithm scales linearly in the number of observations.

4. EXPERIMENTS ON CHAT LINE DATA

The chat line data was collected from the CNN Newsroom chat line http://www.cnn.com/chat/channel/cnn_newsroom. A contiguous stream of almost 24 hours of discussion of 3200 chat participants, contributing 25 000 comment lines, was recorded on January 18th, 2001. The data was cleaned by omitting all user names and non-user generated text. The remaining text stream was split into overlapping windows of about 750 characters. From these windows a term histogram was generated using the Bow toolkit http://www.cs.cmu.edu/~mccallum/bow/, resulting in a term by document matrix X of size T times N, that is, 4743 times 7430. The LSI of order K=50 was computed as a preprocessing step. The choice of the number of estimated topics M is relatively flexible, and in this abstract we present results on M=10, 7 and 4.

Figure 1 shows the topic time series s(t) when 10 topics are assumed. We can see that the topics are autocorrelated in time; different topics are activated at different times. Some topics show short peaks of contribution whereas others are active for longer periods.

Figure 1: Topic time series (horizontal axis: chat windows). 10 topics are assumed.