# Finding topics in dynamical text: application to chat line discussions

######
Ella Bingham

Neural Networks Research Centre

Helsinki University of Technology

P.O.Box 5400, FIN-02015 HUT, Finland

ella.bingham@hut.fi

######
Ata Kabán

Neural Networks Research Centre

Helsinki University of Technology

P.O.Box 5400, FIN-02015 HUT, Finland

kaba-ci0@paisley.ac.uk

######
Mark Girolami

Neural Networks Research Centre

Helsinki University of Technology

P.O.Box 5400, FIN-02015 HUT, Finland

giro-ci0@paisley.ac.uk

### ABSTRACT

The problem of analysing dynamically
evolving textual data has recently arisen. An example of such
data is the discussion appearing in Internet chat lines.
In this paper a recently introduced
method, termed *complexity pursuit*, is used to
extract the topics of a dynamical chat line discussion.
Experimental results demonstrate that meaningful topics can be found
and also suggest the applicability of the method to query-based retrieval
from a temporally changing text stream.

### Keywords

chat line discussion, dynamical text, topic identification, complexity pursuit, data mining

### 1. INTRODUCTION

In times of huge information flow in the Internet,
there is a strong need for automatic textual data analysis tools.
Algorithms
developed for text mining from *static* text collections have been
presented e.g. in [1,2,7].
Our emphasis is in the recently arisen issue of
analyzing *dynamically evolving* textual data;
investigating appropriate tools for this task is of practical importance.
An example of such data is found in
the Internet relay chat rooms: the topic of interest changes
after participants' contributions. The online text stream can
thus be seen as a time series, and
methods of time series processing may be used to extract the topics
of the discussion.

We present results of applying a recently introduced powerful method,
*complexity pursuit* [3],
to topic extraction in a dynamically evolving discussion.
Complexity pursuit uses both information-theoretic measures
and time-correlations of the data, which makes it more powerful than methods
using only one of these -- the latter kind of methods include
[5,6,8,9].

### 2. CHAT LINE DATA

The discussion found in chat lines on the Internet is an ongoing
stream of text generated
by the chat participants and the chat line moderator. To analyze it using
data mining methods a convenient technique is to split the stream into
windows that may be overlapping if desired. Each such window can now be
viewed as one document. We represent the documents using
the vector space model [11]: each document forms one
*T*-dimensional vector where *T* is the number of distinct terms in
the vocabulary. The *i*-th element of the vector indicates (some
function of) the
frequency of the *i*-th vocabulary term in the document.
The term by document matrix **X** contains the
document vectors as its columns and is of size *T* times *N*
where *N* is the number of documents.

As a preprocessing step
we perform the LSI [2] of the data matrix **X**
by computing its singular value decomposition,
thus acquiring a lower dimensional projection of the the high-dimensional
data.
We denote the new data matrix as
**Z** = (**z**(*t*)).
The topics of the discussion
can be found by projecting **Z** to the directions
given by the algorithm described in the following Section.
*M*, the number of estimated minimum complexity projections,
may be smaller than *K*, the dimension of the LSI projection.

### 3. THE ALGORITHM

Complexity pursuit [3] is a recently developed, computationally simple algorithm for separating "interesting" components from high dimensional time series. These components -- here, the topics of the discussion -- are found by projecting the data in the directions given by the algorithm. In complexity pursuit, an "interesting" component is one whose distribution has a low coding complexity, measured by an approximative Kolmogoroff complexity. We model the topics of the discussion as probability distributions on terms, and the distributions having minimum complexity are assumed to best represent the distinct topics.

We assume that the preprocessed *K*-dimensional observations
**z**(*t*)
are linear mixtures of some latent, time-correlated topics *s*. Both
the latent topics and the mixing process are unknown.
The topic time series are found by projecting the observation time series
into directions **w** as
*s*(*t*) = **w**^{T} **z**(*t*)
where the projection directions **w** are to be found.
A separate autoregressive (AR) model, here
,
is assumed to model each topic time series *s*.
At every step of the algorithm, the AR constant
is first estimated.
Then the gradient update of **w** that minimizes the
approximate Kolmogoroff complexity [3] of the AR residuals
= **w**^{T}(**z**(*t*) - *\alpha* **z**(*t*-1))
is the following:

To estimate several projection directions we can proceed in a deflationary
manner, making the new projection direction orthogonal to the previously
found ones, similarly to Gram-Schmidt orthogonalization. After the
estimation of *p* projections, we run the algorithm
for **w**_{p+1} and after every iteration step subtract from
**w**_{p+1} the projections of the previously estimated
*p* vectors, and then renormalize **w**_{p+1}.
Another possibility is
to estimate all projections simultaneously in a symmetric manner and use
orthogonal decorrelation
**W** := (**WW**^{T})^{-1/2} **W** instead of
the normalization (2) in the above set of equations.
Details of the approximation of the Kolmogoroff complexity and its
minimization can be found in [3].

The computational load of the algorithm can be approximated as follows.
The LSI preprocessing of a sparse *T* times *N* data matrix is
of order *O(NTc)* where *c* is the number of nonzero entries in each
column of the data matrix. (In our experiments, the data matrix is very
sparse and *c* is about 30 to 40,
as each document contains about 30 to 40 distinct vocabulary terms.)
The algorithm itself is of order *O(NK ^{2}M)*
where

*K*is the dimensionality (

*K << T,N*) into which the data was projected using LSI, and

*M*is the number of topics estimated. Thus the algorithm scales linearly in the number of observations.

### 4. EXPERIMENTS ON CHAT LINE DATA

The chat line data was collected from the CNN Newsroom chat line
http://www.cnn.com/chat/channel/cnn_newsroom.
A contiguous stream of almost 24 hours of discussion of 3200 chat participants,
contributing 25 000 comment lines,
was recorded on January 18th, 2001.
The data was cleaned by omitting all user names and non-user generated text.
The remaining text stream was split into overlapping
windows of about 750 characters.
From these windows a term histogram was generated using the Bow toolkit
http://www.cs.cmu.edu/~mccallum/bow/,
resulting in a term by document matrix **X** of size
*T* times *N*, that is, 4743 times 7430.
The LSI of order *K*=50 was computed as a preprocessing step.
The choice of the number of estimated topics *M* is relatively flexible,
and in this abstract we present results on *M*=10, 7 and 4.

Figure 1 shows the topic time series *s(t)*
when 10 topics are assumed.
We can see that the topics are autocorrelated in time;
different topics are activated at different times.
Some topics show short peaks of contribution whereas
others are active for longer periods.