Discovering the Gap Between Web Site Designers' Expectations and Users' Behavior

Discovering the Gap Between Web Site Designers' Expectations and Users' Behavior

Takehiro Nakayama, Hiroki Kato, Yohei Yamane
Fuji Xerox Corporate Research Laboratories
430 Sakai, Nakai-machi, Ashigarakami-gun, Kanagawa 259-0157, Japan
{takehiro.nakayama,  hiroki.kato,  yohei.yamane}@fujixerox.co.jp

Abstract

This paper proposes a technique that discovers the gap between Web site designers' expectations and users' behavior. The former are assessed by measuring the inter-page conceptual relevance and the latter by measuring the inter-page access co-occurrence. The discovery of pages that are conceptually related but rarely co-occur in visits suggests areas where Web site design improvement would be appropriate. Further, the technique suggests how to apply quantitative data obtained through a multiple regression analysis that predicts hyperlink traversal frequency from page layout features. The effectiveness of the technique is validated by case studies.

Keywords: Web site design; conceptual relevance; access co-occurrence; regression analysis


1. Introduction

The World Wide Web has grown explosively since its creation. Having gained the attention of millions of people across social and geographical boundaries, the Web is now frequently used by companies attempting to succeed in the global market. Many companies maintain their own Web site, and emphasize its importance for their business.

Business Web sites generally contain a wide range of topics to provide information for users who have different interests and goals. Hypertext on the Web is a convenient means of integrating multiple topics because it allows users to navigate through multiple paths according to their own preferences. Web site designers carefully construct a site structure with the aim of facilitating effective user navigation. However, designing a good Web site is not a simple task because hypertext structures can easily expand in a chaotic manner as the number of pages increases. Thus, many techniques to improve the effectiveness of user navigation have been proposed. They can be broadly classified into two types. One type is described as client-side assistance, and the other as server-side assistance. The former helps users to browse the hypertext structure effectively. For example, Botafogo [3] reconstructs the hypertext structure by employing clustering algorithms based on a graph theory, Lieberman [8] tracks users' browsing behavior to anticipate items of interest by exploring hyperlinks from a user's current position, and Li et al. [7] parse metadata from bookmarked pages for personalized indexing and classification. On the other hand, the latter type helps Web site designers to construct an effective site structure. In this paper, techniques of this type are studied. Before describing our approach, we first review related work.

1.1. Related Work

Extracting user navigation patterns is an essential task to assist Web site designers in understanding the preferences of users for further redesign activities. When users interact with a Web site, data recording their accesses is stored in Web server logs. The access log data includes the client's IP address, the URL of the page requested, and the time the request is received. By identifying individual users' sessions from access log data, it is possible to infer user navigation patterns. For example, Chen et al. [5] convert access log data into a set of consecutive references to find frequent navigation patterns. Borges and Levene [2] model access log data as a directed graph where nodes correspond to pages and arcs to hyperlinks. This graph is weighted and the weight of an arc represents the probabilities that reflect user interaction with the Web site. Further, association rules developed in the data mining field [1] are modified to extract navigation patterns from the graph.

These techniques basically attempt to find access paths that a number of users have followed. However, user session identification from access log data is not always accurate because accesses from search engine robots and proxy servers are also recorded in the log data. The former exhaustively traverse hyperlinks regardless of page contents to collect as many pages as possible, and the latter represent activities of multiple users. Even if accesses from a real user are distinguished by using heuristics, user session identification is still inaccurate because of the caching functionality at the client side, i.e., backward references are not recorded in the access log data. Moreover, frequent navigation patterns only indicate how a Web site of current design is being used. Thus, Web site designers still need to interpret these patterns to improve the site quality.

To assist the interpretation of frequent navigation patterns, some features that can characterize these patterns must be employed. The use of user classification is one approach, where frequent navigation patterns are interpreted differently based on the importance of users. For example, Spiliopoulou et al. [12] compare navigation patterns of customers (Web site users who have purchased something) with those of non-customers. This comparison leads to rules on how the site's topology should be improved to turn non-customers into customers. Another approach uses hyperlink connectivity. For example, Perkowitz and Etzioni [10] find clusters of pages that tend to co-occur in visits but are not connected. For each cluster, an index page consisting of hyperlinks to the pages in the cluster is generated. In this way, more effective traversal between these pages will be achieved.

One drawback of these approaches is that they lack techniques to evaluate infrequent navigation patterns. In other words, they extract no navigation patterns that should be frequent (in the ideal Web site) but are actually infrequent because of poor site design. To find navigation patterns that should be frequent, we presume that page content analysis, as well as access log analysis, will be important.

Another drawback is that these approaches mostly concentrate on the hypertext topology and suggest no clues to improving the site design at the page layout level. In other words, if pages suggested for further improvement are already connected, Web site designers have to work on them without any help. Because the quality of page layout is dependent on many factors (e.g., topics, objectives, users, size, languages, the use of multimedia techniques, visual/logical consistency with other pages, and so forth), analysis of site-specific page layout features will be necessary.

1.2. Our Approach

Under our assumption, Web site designers expect that conceptually related pages should co-occur in visits if their site is well designed. Thus, the finding that some conceptually related pages rarely co-occur would suggest a set of pages that should be improved in terms of Web site design. By employing the vector space model [11] which translates conceptual relevance into a numerical value based on word frequency, we first measure inter-page conceptual relevance for each pair of pages in the Web site. Second, by modifying the vector space model, we measure inter-page access co-occurrence. Third, we compare conceptual relevance with access co-occurrence for each pair. This process can be viewed as discovering the gap between Web site designers' expectations and users' behavior.

Given page pairs with a significant gap, Web site designers can focus their redesign activities effectively. However, they still need to know how they can improve the site design because the cause of the gap is not known. If page pairs are not connected by a hyperlink, making a connection or generating an index page that references them as proposed by Perkowitz and Etzioni [10] is an intuitive solution. However, if page pairs are already connected, Web site designers probably do not know how they should work on each page. Therefore, we analyze the correlation between hyperlink traversal frequency and page layout features to give quantitative suggestions on redesigning the page layout. Further, by using the statistical data obtained above, Web site designers can simulate the effect of redesigning without involving users.

The rest of the paper is organized as follows. Section 2 describes the technique that discovers the gap between Web site designers' expectations and users' behavior. Section 3 studies how the site should be improved, and describes a system that gives quantitative suggestions about page layout design for better user navigation. Section 4 discusses the validation of our approach by means of case studies. Section 5 concludes the paper and gives suggestions for future work.

In this paper, we use Fuji Xerox's public Web site (http://www.fujixerox.co.jp, as of October 1999) as a source of experimental data. It consists of 2,825 textual pages most of which are written in Japanese. The topics include corporate philosophy, recruiting, products, research activities, office location, and so forth. We use access log data that consists of 235,413 accesses after excluding accesses from robots or proxies. We also use JMP (SAS Institute Inc.) for statistical computation.


2. Discovering the Gap Between Web Site Designers' Expectations and Users' Behavior

In this section, we introduce a technique that discovers the gap between Web site designers' expectations and users' behavior. In our approach, the former is assessed by measuring the inter-page conceptual relevance whereas the latter by measuring the inter-page access co-occurrence as described above.

2.1. Measurement of Conceptual Relevance

We employ the vector space model to measure the inter-page conceptual relevance. Given a Web site, we first remove HTML tags from each page. Second, we obtain content words (nouns, verbs, and adjectives) by performing morphological analysis and stop-word removal. Third, we compute the frequency of content words for each page. Fourth, we generate a list of content words weighted with their frequency. This list is viewed as a vector that represents page contents.


where is the weight of the kth content word, and n is the number of distinct content words found in the Web site.

We finally measure the inter-page conceptual relevance (SimC) for each page pair pi and pj using the cosine similarity formula as follows.


where SimC is 0 if one of the pages contains no content words.

If the number of content words that appear in both pages is 0, the value of SimC is also 0. If two pages contain identical content words with the same frequency (i.e., vectors of two pages are identical), the value of SimC is 1. Note that all pages are equally informative in the vector space model because of the page length normalization.

2.2. Measurement of Access Co-occurrence

We modify the vector space model to measure the inter-page access co-occurrence. Given access log data, we first remove accesses from search engine robots and proxy servers using heuristics (e.g., access to /robots.txt; exhaustive access in a short period). Second, we count IP addresses for each page. Third, we generate a list of IP addresses weighted with their frequency. This list is viewed as a vector that represents page users. We finally measure the inter-page access co-occurrence (SimA) for each page pair using the aforementioned cosine similarity formula.


where is the weight of the kth IP address that visited pi, and t is the number of distinct IP addresses found in the access log data. SimA is 0 if one of the pages has never been visited by anyone.

If the number of users who visit both pages is 0, the value of SimA is also 0. If two pages are visited by identical users with the same frequency, the value of SimA is 1. Note that the inter-page access co-occurrence in this model is independent of the page popularity because the number of visits is normalized. It is also independent of hyperlink connectivity.

2.3. Gap Discovery

Before comparing the inter-page conceptual relevance with the access co-occurrence for each pair of pages, we introduce the notions of content page and index page. While the former is an ordinary page that conveys conceptual contents to users, the latter is a functional page for navigational help. In general, the index page has multiple references to content pages of various topics, and tends to include content words of various topics without conceptual consistency. Consequently, measuring the conceptual relevance between the content page and the index page (or between two index pages) generates noisy data with a meaningless value. We therefore discard index pages in advance. The question here is how we should distinguish index pages. Because many pages actually have characteristics of both a content page and an index page, the boundary between them can be subjective. In this paper, we use the number of references in a page as a guide, based on the intuitive idea that an index page should have more references than a content page. We consider a page with more than N references as an index page. To determine the optimal value of N, we compute the correlation coefficient (RCA) between the inter-page conceptual relevance (SimC) and the access co-occurrence (SimA) using the following formula.



where




where SimCi is the value of conceptual relevance for the ith pair of pages, SimAi is the value of access co-occurrence for the ith pair of pages, and m is the number of page pairs.

The correlation coefficient (RCA) measures the degree of linear relationship between two variables (SimC and SimA). If there is an exact linear relationship, it is 1 or -1 depending on whether the variables are positively or negatively related. If there is no relationship, it is 0 (see [6] for more detail). Thus, if index pages (which generate noisy data) are properly discarded, the correlation coefficient will tend toward 1. Figure 1 shows the result of this computation, where we used the Fuji Xerox Web site. The peak is observed at N = 8, i.e., pages with more than eight references can be considered as index pages in this Web site.


Fig 1. Index page determination

The correlation coefficient above can also be used as a criterion to indicate the overall design quality of the Web site. It would tend toward 1 if the overall site design were ideal. However, it does not indicate where the Web site requires improvement. For this purpose, we plot the inter-page conceptual relevance versus the access co-occurrence for each page pair as shown in figure 2. The straight line in the figure is a fit to the plot using least square regression. The markers on the lower right show the page pairs that rarely co-occur in visits even though they are conceptually related. Web site designers can locate the URLs of these pages by pointing at the markers.

Our technique, depending on the size and quality of the Web site, may find many page pairs that should be improved, but it can also give a structural view for browsing assistance. The technique first transforms the set of page pairs into a set of distinct pages, then applies a content-based agglomerative hierarchical clustering algorithm [13] to the new set. It finally shows page clusters, which can help Web site designers to understand the design problem at a more abstract level. For example, we found that many pages in the lower right area in figure 2 were about products of the company.


Fig 2. Conceptual relevance versus access co-occurrence

3. Web Site Design Improvement

There are two levels of assistance for Web site design improvement. One is hyperlink topological improvement, and the other is page layout improvement. For example, 67 page pairs of 74 in the lower right rectangle region (conceptual relevance > 0.6; access co-occurrence < 0.2) in figure 2 are not connected, and they will require hyperlink topological improvement. On the other hand, the rest are connected, and they will require page layout improvement for effective user navigation.

3.1. Hyperlink Topological Improvement

If page pairs are not connected, creating a new hyperlink between them or making the path length shorter is an intuitive solution. To provide empirical validation for this solution, we investigate the Web site mentioned above. For each pair of pages, we compute the path length (the number of hyperlinks required to reach from one page to the other). Obviously, there can be multiple paths between any two pages in the hypertext system. However, since considering all paths would be computationally expensive, we compute only the shortest path, taking account of the hyperlink direction. Figure 3 plots the path length versus the access co-occurrence for each page pair. The straight line is a fit to the plot using least square regression, where we can observe a negative correlation as expected.



Fig 3. Path length versus access co-occurrence
Since it was hard to see patterns in the original plot because of the marker density, we used the quantile density contours in 5% intervals. This means that 5% of the markers are below the lowest contour, 10% are below the next contour, and so forth. The highest contour has 95% of the markers below it.

3.2. Page Layout Improvement

When page pairs are already connected, Web site designers have to consider the improvement of page layout design. This is a hard task since there are two basic problems as follows.

The quality of page layout is dependent on many factors (e.g., topics, objectives, users, size, languages, the use of multimedia techniques, visual/logical consistency with other pages, and so forth). Therefore, Web site designers probably do not know how to improve it most effectively.

Even if Web site designers redesign the page layout based on experience (or by trial and error), they still need to evaluate whether or not it is improved in terms of effective user navigation. Designers typically upload redesigned pages to the site, then wait for several days to collect a sufficient amount of access log data for evaluation. If the result of evaluation is not satisfactory, they will repeat the process above. This may damage the reputation of the site if the redesigned pages are of worse quality than the original ones. Further, frequent changes in design may confuse users who often visit the site.

To mitigate these problems, we built a system that gives quantitative suggestions on redesigning the page layout, and enables Web site designers to simulate the effect of redesigning. The system, in advance, analyzes hyperlink traversal frequency and page layout features for each page in the Web site. The former is the probability of users visiting P2 if they have already visited P1, where there must be a hyperlink from P1 to P2. Given a page pair, the system analyzes the hyperlink traversal frequency in both directions if there exist hyperlinks for both directions. Note that this is different from the inter-page access co-occurrence that can be used for page pairs regardless of hyperlink connection.

 As regards the page layout features, we have selected the following five features. They are all computable, i.e., the system can automatically extract these features.

When many hyperlinks are embedded in a page, the probability of individual hyperlink selection will decrease. Note that because we discard index pages that contain more than N hyperlinks in the gap discovery process as described in Section 2, the maximum number of distinct hyperlinks here is N.

It is empirically proven that the importance of a sentence in a text is related to its ordinal position [9]. By analogy, the selection probability of a hyperlink may be related to its ordinal position. We presume that a hyperlink in an earlier position will have a higher probability of selection.

We consider the hyperlink size, which is measured by the number of characters between the anchor tags, as a feature. If the anchor text contains many characters, the region it occupies in the Web browser will increase. Consequently, we expect it has a higher probability of selection.

If hyperlink anchor text appropriately describes the content of the referred page, user navigation will be more successful. We measure the relevance between anchor text and the referred page using the vector space model mentioned in Section 2.

If sentences preceding a hyperlink appropriately describe the content of the referred page, user navigation will be more successful. The region of the text is delimited by an HTML tag.

Our system employs multiple regression analysis to predict the hyperlink traversal frequency from a linear combination of these page layout features. Since no features are infallibly informative for this purpose, the system uses the stepwise forward selection procedure (see [6]) to remove uninformative ones. This procedure adds the page layout features one at a time, until no more features explain significant variation. By applying the procedure to the Fuji Xerox Web site, the system added all the features except hyperlink size which was not significant at the probability to enter value of 0.25. This means that hyperlink size is not an informative feature, at least in this Web site. In the remainder of this paper, we use the other features for experiments.

Next, the system obtains the following regression equation.



where Y is hyperlink traversal frequency, Xi is the score on the ith page layout feature, b0 is the regression constant, bi is the weight corresponding to the ith page layout feature, and p is the number of remaining page layout features after the stepwise procedure.

This equation reflects the site-specific characteristics of page layout design in terms of user navigation. In other words, hyperlink traversal frequency can be predicted when these page layout features are known.

Given a pair of pages (from the lower right area in figure 2), the system analyzes the page layout features, and substitutes them for the corresponding Xi in the regression equation. If one feature (kth feature) is not substituted at this step, the regression equation will be a linear equation of a single variable as follows.


For each page layout feature, the system obtains the equation above, and displays each graph as shown in figure 4. When x1 is the current value of Xk, y1 is the predicted value of Y. To increase y1 to y2, this graph suggests that the value of Xk should be x2 as long as other features never change.


Fig 4. Hyperlink traversal frequency (Y) as a function of page layout feature (Xk)

When multiple features are simultaneously redesigned, the system reanalyzes the page layout features, and repeats the procedure above. If the effect of redesigning activities is simulated in this manner, Web site designers need not upload redesigned pages to the real Web site for evaluation. Consequently, they can avoid possible damage to the reputation of the Web site.


4. Case Study

In this section, we describe and discuss some cases that indicate the advantages (and disadvantages) of our approach.

4.1. Case 1: Hyperlink Topological Improvement

We examined page pairs in the lower right area in figure 2. As described above, they are conceptually relevant but rarely co-occur in visits. Figure 5 shows one of these page pairs that are not connected to each other. Each page explains the technical features of a different product (multiple function copier/fax/printer), respectively. Since these two products are very similar to each other except for the printing speed, the contents of both pages are also similar. In fact, the value of their conceptual relevance was 0.95 (very high). Probably, users who are interested in purchasing one of the products are also interested in the other. However, the value of access co-occurrence between the pages was only 0.13. This can be explained by their hyperlink connectivity. The hyperlink distance was eight for their shortest path, i.e., at least eight clicks were needed to visit both pages. By understanding this situation, Web site designers can create a shortcut between the pages for better user navigation.

This case study shows the advantages of our gap discovery technique described in Section 2. Since the cause of the situation is clear, Web site designers can easily understand what they should do in this case.


Fig 5. Case study 1: Pair of pages that are not connected

4.2. Case 2: Page Layout Improvement

Figure 6 shows one of the page pairs in the lower right area in figure 2. The left page is a general introduction to engineering printing systems, whereas the right is a concrete explanation of one system. They are conceptually related, and the value of their conceptual relevance was as high as 0.65. On the other hand, the value of access co-occurrence was 0.17. Unlike in the previous case, there is a hyperlink connection from left to right. We therefore presume that few users who visited the left page followed the hyperlink (labeled BrainTech8180a).


Fig 6. Case study 2: Pair of pages that are connected

To investigate this situation, our system uses the multiple regression equation to predict the hyperlink traversal frequency as described in Section 3. The equation obtained for this Web site was as follows.

Y = 0.1004 - 0.0025*X1 + 0.0011*X2 + 0.1196*X3 + 0.0841*X4

where

Y: hyperlink traversal frequency

X1 : the number of distinct hyperlinks

X2 : the position of a hyperlink

X3 : conceptual relevance between anchor text and the referred page

X4 : conceptual relevance between text preceding a hyperlink and the referred page

Note that X2 has a positive coefficient, which is contrary to our aforementioned anticipation that a hyperlink in an earlier position will have a higher probability of selection. This can be interpreted to mean that users of this Web site tend to scroll (to read the whole page), then decide which hyperlink to follow. This interpretation is also supported by a recent research study that indicates that users spend a great deal of time scrolling [4].

Given the pair of pages in figure 6, the system extracted the value of each feature as follows.

X1 = 8 (the number of hyperlinks in the left page)

X2 = 6 (the ordinal position of the hyperlink labeled BrainTech 8180a)

X3 = 0.56 (the conceptual relevance between the anchor text (i.e., BrainTech 8180a) and the right page)

X4 = 0 (the value is 0 because there is no text preceding the hyperlink )

By substituting these values for the variables in the equation above, the system obtained Y = 0.154 (the predicted value of hyperlink traversal frequency). This value means that 15.4% of users who visit the left page are predicted to follow the hyperlink.

Further, the system gives the regression equation between Y and Xk (k =1,2,3,4) as described in Section 3. Figure 7 shows the equation for each feature as a quantitative suggestion. Web site designers can now understand what they should do to increase the value of Y. If they remove several hyperlinks from the page (reduce the value of X1), Y will increase. However, if Y = 0.2 is required, this redesign is still insufficient. Designers need to consider another feature (or multiple features) for that purpose. If, for example, the value of X4 is increased from 0 to 0.55 by adding a paragraph that describes the system (BrainTech 8180a), the purpose will be achieved. When we tried copying the first paragraph in the right page (in figure 6) to the space before the hyperlink in the left page, the system reanalyzed the value of X4 to return X4 = 0.48 and Y = 0.194. If Web site designers are still not satisfied with this value of Y, they can consider further redesign with a new quantitative suggestion based on reanalysis.

This case study shows the advantages of our system for page layout improvement as well as gap discovery. By using our system, Web site designers can understand what they should do based on quantitative suggestions. Moreover, they can simulate their redesign activities without involving real users, i.e., they need not collect access log data for each redesign step for the purpose of evaluation.





Fig 7. Quantitative suggestion for each feature
The dotted line indicates the value of each feature (Xk) and the predicted value of hyperlink traversal frequency (Y: 0.154)

4.3. Case 3: Higher Access Co-occurrence Despite Less Conceptual Relevance

In this section, we investigate page pairs in the upper left area in figure 2. Unlike those in the lower right area, these pairs are conceptually less related but frequently co-occur in visits. In these cases, we assume that some users are navigating through the Web site with a goal that cannot be characterized by the page content analysis based on the vector space model.

Figure 8 shows one of the page pairs in the upper left area in figure 2. Each page shows the location of a different office. Perhaps some users were interested in the offices of Fuji Xerox, and visited both these pages. However, the value of their conceptual relevance was only 0.16. This is because the vector space model depends only on the frequency of surface words. These pages contain the office address, but contain no words that explicitly express the meaning of location. Thus, the value did not reflect the abstract notion of location.


Fig 8. Case study 3: Pair of pages that are considered conceptually less relevant

This case study shows the disadvantage of our system that employs the vector space model for the measurement of inter-page conceptual relevance. To solve this problem, semantic analysis for exploring contents at a deeper level is needed. However, this approach is unrealistic given the current natural language understanding technology.

For a realistic solution, the use of metadata that describes page contents would be a good approach. If Web site designers embed machine-readable metadata in each page, we can use this data to measure conceptual relevance. Because the W3C has introduced the Resource Description Framework (see http://www.w3.org/RDF/) as a world-wide standard, the use of metadata is being encouraged.


5. Conclusions and Future Work

We have presented a technique that discovers the gap between Web site designers' expectations and users' behavior. The former are assessed by measuring the inter-page conceptual relevance and the latter by measuring the inter-page access co-occurrence. Further, plotting both on the same graph reveals the gap, and we have shown that removing index pages, which can be identified by the number of outgoing hyperlinks, prevents the generation of noisy data.

We have also presented a system that gives quantitative suggestions for page layout improvement. The system uses multiple regression analysis to predict hyperlink traversal frequency from selected page layout features. Web site designers can simulate the redesigning process using the system without involving real users. We have validated the effectiveness of our approach using case studies.

As a future work, we suggest the examination of the use of metadata for measuring inter-page conceptual relevance. If Web site designers utilize metadata by which they explicitly describe page contents, the measurement of inter-page conceptual relevance will accurately reflect their expectations.

We also suggest investigation of other page layout features to predict hyperlink traversal frequency. For example, color and multimedia usage are possible features. However, because these features can not be represented in a continuous/ordinal scale but in a nominal scale, multiple regression analysis is not directly applicable. Thus, a technique that transforms the data so that the transformed variables have good distributional properties is necessary.


Acknowledgements

This study was conducted as part of the Fuji Xerox Document Mining research project led by Yoshihiro Ueda, who provided us with essential support.


References

[1] R. Agrawal, T. Imielinski, and A. Swami, Mining Associations between Sets of Items in Massive Databases, in: Proc. the ACM SIGMOD International Conference on Management of Data, 1993.

[2] J. Borges and M. Levene, Mining Association Rules in Hypertext Databases, in: Proc. the 4th International Conference on Knowledge Discovery and Data Mining, 1998.

[3] R. A. Botafogo, Cluster Analysis for Hypertext Systems, in: Proc. the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993.

[4] M. D. Byrne, B. E. John, N. S. Wehrle, and D. C. Crow, The Tangled Web We Wove: A Taskonomy of WWW Use, in: Proc. ACM CHI Conference: Human Factors in Computing Systems, 1999.

[5] M. S. Chen, J. S. Park, and P. S. Yu, Data Mining for Path Traversal Patterns in a Web Environment, in: Proc. the 16th IEEE International Conference on Distributed Computing Systems, 1996.

[6] C. J. Huberty, Applied Discriminant Analysis, John Wiley & Sons, Inc., 1994.

[7] W. S. Li, Q. Vu, D. Agrawal, Y. Hara, and H. Takano, PowerBookmarks: A System for Personalizable Web Information Organization, Sharing, and Management, in: Proc. the 8th International World Wide Web Conference, 1999.

[8] H. Lieberman, Letizia: An Agent That Assists Web Browsing, in: Proc. the International Joint Conference on Artificial Intelligence, 1995.

[9] C. Y. Lin and E. Hovy, Identifying Topics by Position, in: Proc. the 5th ACL Conference on Applied Natural Language Processing, 1997.

[10] M. Perkowitz and O. Etzioni, Adaptive Web Sites: Automatically Synthesizing Web Pages, in: Proc. the 15th National Conference on Artificial Intelligence, 1998.

[11] G. Salton, Developments in Automatic Text Retrieval, Science, Vol.253, 1991.

[12] M. Spiliopoulou, C. Pohle, and L. C. Faulstich, Improving the Effectiveness of a Web Site with Web Usage Mining, in: Proc. the Workshop on Web Usage Analysis and User Profiling, 1999.

[13] E. M. Voorhees, Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval, Information Processing & Management, Vol. 22, 1986.


Vitae

Takehiro Nakayama is a member of the research staff at Fuji Xerox Corporate Research Laboratories. Prior to that, from 1992 to 1996, he was a member of the research staff at FX Palo Alto Laboratory. He received his M.S. in Information Science from Kyushu University in 1989. His interests include Web site analysis, information retrieval, natural language processing, and document image understanding. 

Hiroki Kato is a member of the research staff at Fuji Xerox Corporate Research Laboratories. He received his M.S. from Tokyo Institute of Technology in 1997. His research interests lie in Artificial Intelligence, especially in Machine Learning. His current research includes Web mining and personal assistant agent for the Web.

Yohei Yamane is a member of the research staff at Fuji Xerox Corporate Research Laboratories. He received his M.S. from Nara Institute of Science and Technology (NAIST), where he studied natural language processing, in particular anaphora resolution and centering theory. Presently, he works on Web site analysis.