Proceedings Template - WORD

System for Reminding a User of Information Obtained through a Web Browsing Experience


Tetsushi Morita

NTT Corporation@@@NTT Cyber Solutions Laboratories   

 3-9-11 Midori-Cho Musashino-shi Tokyo, 180-8585 Japan
81-422-59-4840

morita.t@

lab.ntt.co.jp
Tetsuo Hidaka

NTT Corporation@@@NTT Cyber Solutions Laboratories   

 3-9-11 Midori-Cho Musashino-shi Tokyo, 180-8585 Japan
81-422-59-7150

hidaka.tetsuo@

lab.ntt.co.jp
Akimichi Tanaka

NTT Corporation@@@NTT Cyber Solutions Laboratories   

 3-9-11 Midori-Cho Musashino-shi Tokyo, 180-8585 Japan
81-422-59-4483

tanaka.akimichi@

lab.ntt.co.jp
Yasuhisa Kato

NTT Corporation@@@NTT Cyber Solutions Laboratories    

3-9-11 Midori-Cho Musashino-shi Tokyo, 180-8585 Japan
81-422-59-4420

kato.yasuhisa@

lab.ntt.co.jp


ABSTRACT

We propose a system for reminding a user of information obtained through a web browsing experience. The system extracts keywords from the content of the web page currently being viewed and retrieves the context of past web browsing related to the keywords. We define the context as a sequence of web browsing when many web pages related to the keyword were viewed intensively because we assume that a lot of information connected to the current content was obtained in the sequence. The information is not only what pages you viewed but also how you found those pages and what knowledge you acquired from them. Specifically, when you browse web pages, this system automatically displays a list of the contexts judged to be important in relation to the current web page. If you select the context, details of the context are shown graphically with marks indicating characteristic activities. 

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrievalretrieval model, Search process

General Terms: Algorithms,  Management, Design.

Keywords: Context, Information Retrieval, Userfs Behavior, History.

1.     INTRODUCTION

 

Copyright is held by the author/owner(s).

WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.

ACM 978-1-59593-654-7/07/0005.

Have you ever been frustrated at failing to rediscover useful web pages that you viewed in the past? We forget and waste various information that we obtain through our own web browsing. Most of us have retrieved the same web page more than once. One report says that the retrieval rate of previously seen web pages among all pages that people view is 81% [3]. The information that we obtain through an experience, such as web browsing, is not limited to the content of web pages. We seem to recognize which web pages we viewed in a session, how the pages were found, and what knowledge we acquired from them. We define the information as gobtained informationh. In this paper, we describe a system that aims to remind a user of previously obtained information efficiently. The bookmark function of a web browser and a desktop search system are popular to help us to retrieve previously seen web pages by keyword matching [2]. Several

semantic desktop search systems have been proposed. One of them helps a user to retrieve web pages and e-mails according to intimate information by adding metadata such as the URL of a web page that is visited subsequently and the destination address of an e-mail [1]. A previous version of our method helps users to retrieve web pages viewed in the past by calculating their personal importance by using log data from personal computers [4]. These desktop search systems help a user to find a web page viewed in the past efficiently. However, they do not seem to be so good at reminding us of the various kinds of information that we acquired simultaneously in the past web browsing experience because we need to choose and visit many retrieved independent web pages.

2.     PROPOSED SYSTEM

We focus on the context in the past. The context is a sequence of web browsing when many web pages related to the content of the web page currently being viewed were viewed intensively and a lot of actions were performed. We call the time of this sequence an gintensive periodh. We assume that a lot of obtained information is also contained in the context. For example, if a user is researching a product, he first finds the context related to current web pages and then he acquires a lot of obtained information such as the URLs of multiple web pages that were visited at that time. By chronologically tracing his activities, such as which web pages he viewed in the context, he learns how to find the pages and what knowledge he got from them.

2.1     Collecting action logs

It is difficult to force a user to perform the actions required to create history data such as recording when and how he or she viewed a web page. A logging module collects the information about a computerfs mouse, keyboard, copying, and printing events and window conditions, the URLs of visited pages, source files, thumbnails, http headers, text selected by the user, and so on. It has an encryption function to protect the userfs privacy.

2.2     Extracting keywords of current web page

Our system analyzes the content of the current web page to extract keywords that represent the web page. Its technique is very simple. An analyzer of a browser component obtains the current content C and characterizes it by extracting the most frequent terms. A score Si is then determined for each term tiC, where Si = (1 +0+cn)R(ti) and i is a weighting coefficient that varies heuristically with the locality of ti. That is, extracted terms instanced in anchor text are assigned a higher weight than those in <body> text, but a lower one than those in <title> or <h1>. This R(ti) is a variation of the TF-IDF algorithm, where the term frequency of ti is multiplied by the inverse document frequency of ti to approximate each termfs importance. Next, the top N ranked characterization terms with weights where stemming and other adjustments are applied are regarded as a set of representative keywords of the current web page. We call this set of keywords the gcurrent keywordsh.

2.3     Extracting past contexts

We extract an intensive period through the following steps. First, the degree of importance I of a random time t to current keywords k is calculated by eq. (1). We focus on an active period ap when a web page is actively shown in a window. Ei is the weighting factor of action category i such as the amount of active time, copying, printing, mouse clicking, keyboard input, and text selection. Fri is the number of occurrences of i in ap. R is a relevance ratio of a web page in ap to k, which is given the value of TF-IDF based on the set of all web pages logged by the logging module. Then, the average degree of importance AI of t to k is determined from eq. (2). Here,  is a parameter for averaging. If the average degree is not more than parameter , the operation related to the current keywords is regarded as being discontinuous at time t (Fig. 1). The values of and are decided by heuristics. In this way, the method can regard an intensive period ip as a past