Knowledge Encapsulation for Focused Search from Pervasive Devices

Knowledge Encapsulation for Focused Search from Pervasive Devices

Yariv Aridor1, David Carmel1, Ronny Lempel 2, Yoelle S. Maarek1, Aya Soffer1
1IBM Research Lab in Haifa
MATAM, Haifa 31905, ISRAEL
contact email: carmel@il.ibm.com
2Dept of Computer Science
Technion, Haifa 32000, ISRAEL

 

Copyright is held by the author/owner(s).

WWW10, May 1-5, 2001, Hong Kong.

ACM 1-58113-348-0/01/0005.

Abstract

Mobile knowledge seekers often need access to information on the Web during a meeting or on the road, while away from their desktop. A common practice today is to use pervasive devices such as Personal Digital Assistants or mobiles phones. However, these devices have inherent constraints (e.g., slow communication, form factor) which often make information discovery tasks impractical.
In this paper, we present a new focused-search approach specifically oriented for the mode of work and the constraints dictated by pervasive devices. It combines focused search within specific topics with encapsulation of topic-specific information in a persistent repository. One key characteritic of these persistent repositories is that their footprint is small enough to fit on local devices, and yet they are rich enough to support many information discovery tasks even in disconnected mode. More specifically, we suggest a representation for topic-specific information based on "knowledge-agent bases" that comprise all the information necessary to access information about a topic (under the form of key concepts and key Web pages) and assist in the full search process from query formulation assistance to result scanning on the device itself. The key contribution of our work is the coupling of focused search with encapsulated knowledge representation making information discovery from pervasive devices practical as well as efficient. We describe our model in detail and demonstrate its aspects through sample scenarios.

 

Keywords: Focused search; Pervasive devices; Disconnected search; Knowledge agents.

1. Introduction

Pervasive devices, such as Personal Digital Assistants (PDAs) and mobile phones, currently provide much more functionality than they were originally designed for. PDAs are no longer simple organizer/calendar tools. Hundreds of applications are available on the PalmOS platform alone (from ebooks to Web browsing), both in connected and disconnected modes (see palm.net webclipping applications [19]). This trend can be observed in the mobile phone business as well, with WAP phones offering a wide variety of services ranging from flight information to remote banking and e-trading.

As these devices developed additional functionality, they have evolved into general purpose information appliances in the form of very thin clients. Information access requires adequate information discovery abilities, that is to say proper mechanisms for searching and browsing information. The nature of pervasive devices, however, imposes new constraints on the classic search and browse paradigms that reign in the connected desktop world. These constraints are mostly due to:

  1. The form factor of pervasive devices:
    Entering information is extremely cumbersome on pervasive devices. Whether the input device is a stylus supporting graffiti or handwriting recognition, a "soft keyboard" (i.e., a virtual keyboard shown on the display), or even worse a telephone keypad, the rate and accuracy of the input are quite limited. Moreover, pervasive devices are limited in terms of output due to their small screen size.
  2. The communication mode of pervasive devices:
    Communication between pervasive devices and external resources varies from sporadic synchronization with a personal computer (via a cradle or modem) to full connectivity via external wireless modems, built-in connectivity (Palm VII [19]), or RF (RIM handheld devices [21]). The communication mode directly influences the timing of discovery tasks. In the first case, most discovery tasks need to be performed in batch (during synchronization), while in the second case we may wish to minimize the amount of information that is transferred over the wire in order to improve response time, cut costs, conserve battery consumption, reduce radiation exposure and more.

The form-factor-related constraints have a direct impact on the success of any PDA application. Indeed if the mode of interaction is awkward or too time consuming, the application will most likely not be adopted by users. In the context of WWW information discovery, two types of interaction are critical: navigation (for browsing and exploring) and text input (for searching). Many user studies have been conducted in order to evaluate various methods for information navigation and for text input given various form factor constraints. As discussed in [5], many browsing paradigms that we take for granted in the desktop world have to be reconsidered in the context of small devices. For instance, scrollbars that are widely used on desktops are much harder to master on small devices and thus the simple act of scrolling itself often becomes disorienting. Consequently, content is often pre-packaged in device-dependent formats which simplify the navigation task. Examples include WML decks for WAP phones [27], and HDML pages for Palm VII. Ideally, content should be automatically or semi-automatically translated from a common format (e.g., XML) into device-specific formats, and in fact many transcoding products [22] tackle exactly this problem. However, this ideal solution is not realizable at this point since content is currently not provided in one common format, and since transcoding has turned out to be an elusive task.

Since an all-in-one solution does not seem near, it is necessary to devise interim application-specific and device-specific solutions. It is suggested in [11], for example, that navigation can be greatly enhanced by the definition of short-cut keys that facilitate faster browsing through WML decks in the context of a business card application on WAP phones. A promising direction has also been explored in the Power Browser system [5] that reorganizes the content of individual Web sites so as to decouple the navigation and viewing phases. This method was designed for local site search.

In terms of text input, several approaches have been proposed to enhance the user's experience for various input devices. Tegic T9 text input technology [24] provides a disambiguation mechanism which enables text input on phone keypads with much fewer keystrokes than is usually necessary. Masui [16] provides both word-level completion (via menus and dynamic approximate string matching) and phrase-level completion. The Power Browser [5] supports site-specific word completion.

The second class of constraints we consider pertains to the communication mode of pervasive devices, especially in terms of bandwidth limitations and response time requirements. Even if the interaction process is greatly enhanced by the techniques described above, a typical search process still requires numerous transactions between the wireless device and the content-storing servers (which include both search engines and servers that hold the actual target data). In [5] for instance, where users can search a Web site via a wireless PDA, at least six interactions are necessary before PDA formatted search results are displayed. Even with reasonable bandwidth, each such transaction is noticeable by the user who will consequently wait a few seconds for meaningful results.

A common solution geared to improve response time in many computer applications is to cache some information where it is required and thus provide quasi immediate feedback without having to access the original storage location. In the context of pervasive devices, we would like to store/cache some information on the local device, thus eliminating the need to access remote servers during some stages of the information discovery task. Avantgo [1] and IntelliSync [10] enable caching a pre-selected set of Web pages on pervasive devices. However, caching data for general purpose Web information discovery tasks would require an insurmountable amount of local storage space.

Focused Search

The analysis of the constraints inherent to pervasive devices such as PDAs presented above, leads us to believe that a promising direction for making search practical on pervasive devices is using a focused topic-specific search model, that works mostly in disconnected mode, with periodic downloading of data to the device. Focused search addresses the above issues as follows:

  • The complexity of navigation on pervasive devices can be further reduced if, rather than focusing on Web sites as in [5], we focus the browsing on specific topics. In particular, topic discovery and browsing is possible by revealing various aspects of the topic-specific knowledge to the user such as top phrases and top Web pages. A topic-specific approach also seems more practical as most discovery tasks involve compiling information from various resources on a specific topic, rather than exploring an individual Web site.
  • Working at the topic level allows us to further enhance input completion techniques. We can provide accurate and useful topic-specific query formulation assistance, which includes word completion as well as phrase completion and expansion. Note that topic-specific phrase completion not only saves several keystrokes, but also improves the precision of the query results by disambiguating the query in the context of the topic at hand. The role of query refinement as a method to improve search results has long been recognized [25,30], and topic specificity can improve this aspect of query refinement as well.
  • Addressing the bandwidth limitations of pervasive devices and the response time constraints which applications must meet, we note that limiting ourselves to topic-specific discovery tasks greatly reduces the required amount of cached data, and thus caching based techniques become feasible.

Focused search and crawling has been proposed in the past, mostly with the motivation of assisting novice Web surfers and/or improving the quality (precision) of results. Some systems and services predefine the topics of interest (a.k.a. domains) themselves (see for instance various directory services such as Yahoo [29] and Google [8]), while others allow the users to define their topics of interest (e.g., Fetuccino [9,3], Focused Crawler [7], WTMS [18], and knowledge agents [2]).

Since the amount of memory available on pervasive devices is very limited, it is crucial that the information stored locally be highly representative of the topic, yet concise. While the focused search techniques mentioned above can be used to collect a representative collection of Web pages for a particular topic, they are insufficient in terms of encapsulating the knowledge representation required to support all of the information discovery tasks on the pervasive device.

Our approach

In this paper, we describe a focused-search approach specifically designed for the mode of work and the constraints dictated by pervasive devices. Topics are represented and stored in a persistent repository with a very small footprint which can be downloaded to small devices. More specifically, we suggest representing topic-specific information via "knowledge-agent bases" (KABs) that encapsulate information required to assist users in their discovery process from pervasive devices. The idea is to allow users to predefine a topic of interest, and then capture a very small but representative piece of the Web for this topic, storing characteristic information regarding it on the PDA. The information we store includes a topic-specific lexicon, a small number (~100) of the most authoritative Web pages and hubs of the domain (core pages) whose text is stored in its entirety, and a relatively larger number (on the order of thousands) of second-tier Web pages (outlinks of the core pages) for which we store only the URL and anchor text of the outlink. An index which supports search on the text of the core pages as well as on the anchor text of the referenced pages is also stored on the device.

The KAB's are built automatically using the Knowledge Agent technology that we have described in [2]. Users need only to supply several (4-6) queries which define the topic, along with an optional set of seed URLs which they consider to represent the topic at hand. We elaborate on the knowledge acquisition process in Section 2.1.

Combining focused search with encapsulated knowledge representation holds the promise of making information discovery from pervasive devices practical as well as efficient. The advantages of this approach are numerous:

  1. Response time for many initial discovery tasks is negligible, as they can be performed locally on the device, in fully disconnected mode. Examples include such tasks as query refinement, URL browsing, and searches within a representative set of Web pages. These can all be performed instantaneously on the PDA with no need for communication with a remote server.
  2. Enhanced query input is supported via topic-specific word and phrase completion.
  3. KAB's can be easily exchanged with colleagues since their footprint is rather small.
  4. The domains are user defined and can be of any granularity as they are created via several seed URLs and a few sample queries.
  5. Topic discovery and browsing is supported by revealing various aspects of the knowledge base such as the top phrases in the lexicon and the list of core pages.
  6. Performing tasks in a disconnected mode is in itself a useful feature to users who are unable to connect to a server since they are in remote location not well covered by their wireless service, or in the middle of a meeting, or whether they wish to reduce battery consumption and radiation exposure.

The rest of this paper is organized as follows. Section 2 presents the topic specific knowledge representation and acquisition methods we use. Section 3 describes the process of information discovery in disconnected mode including topic browsing, query formulation, and disconnected search. Section 4 consists of some sample scenarios and examples using knowledge encapsulation for focused search on a Palm device. Section 5 concludes by reiterating the key contributions of this work and pointing to future directions. Related work is discussed as relevant throughout the paper.

2. Topic Specific Knowledge Representation and Acquisition

Knowledge agents is a methodology for focused information discovery that the authors of this paper introduced in [2]. The goal of the knowledge agent approach is to gain persistent knowledge on a given topic in order to assist users in their information discovery tasks. Knowledge agents are created by users, and undergo a training phase during which they act on behalf of their creator in acquiring automatically expertise on a given topic. This knowledge is subsequently leveraged to facilitate both the search task (query formulation assistance, disconnected search, result ranking) and the browsing (by providing a set of pertinent pages and key concepts) on matters pertaining to the agent's domain of expertise.

The motivation for using a knowledge agent approach in our context is its compact representation, or encapsulation, of topic-specific knowledge. The compactness of the representation allows its deployment in the limited resource environment of pervasive devices, while the topic specificity of the represented knowledge answers the form-factor and communication model constraints of these devices, as explained in the Introduction.

The key adaptation of the knowledge agent approach to pervasive information discovery is the physical detachment between the knowledge acquisition phase and the knowledge application phase. Knowledge acquisition is done on a desktop server, with continuous (and preferably broadband) Internet access. After the topic-specific knowledge is acquired (and represented in a KAB), it is downloaded to the pervasive (and often disconnected) device, and serves to conduct as many information discovery tasks as possible either in fully disconnected mode or with minimal access to remote servers.

Figure 1 captures the server/device interaction. The knowledge acquisition phase can be initiated either from a connected desktop or from a PDA connected through a cradle or a wireless modem. Once the KAB is ready, it can be downloaded to the device again through a cradle or wireless modem. The user will only need to reconnect to the server to obtain updates or create new agents. Note that small KABs can also be obtained directly from other users, in a peer-to-peer communication exchange through direct beaming (a small KAB of about 400K takes about two minutes to exchange). Larger KABs can still be exchanged between users (an average KAB of about 1MB can take a few minutes). However, since such a long beaming operation might be inconvenient (not to mention battery consumption problems), it is preferable in the case of larger KABs to beam only a reference to the KAB on the server, and let the receiving party download it directly from the server.

archi.gif (14801 bytes)
Figure 1 - Server/Device Interaction

This section focuses on the knowledge representation and acquisition phases. Section 3 will detail how this knowledge is applied to information discovery on pervasive devices, and specifically how pervasive search on PDAs is made both effective and efficient.

2.1 Knowledge Representation

In order to design a topic-specific, Web oriented knowledge acquisition process, we must first answer the following questions:
  1. What is a topic on the Web?
  2. How will the creator define the topic in question to the Knowledge Agent?
  3. What should the agent learn about the topic, in order to assist future users' information retrieval tasks?

We answer the first question in the broadest manner possible, by defining a topic as simply a set of thematically related URLs. This definition poses no restrictions on the granularity of topics, and indeed users may create agents to tackle topics of any breadth. With this definition, two outright manners for defining a specific topic t to an agent come to mind:
  1. Specifying some members of the set of URLs: The creator may explicitly enter several URLs which pertain to the topic t of the agent. While this form of input is allowed at any point in the training phase of the agent, the typical scenario is for these URLs to be given at the time of the agent's initialization. The URLs themselves should be deemed by the creator as being valuable Web resources on t, and may originate, for example, from the creator's personal bookmarks files.
  2. Specifying the thematic relationship between the pages of the set. This is done by providing the agent with a small set of (usually 4-6) queries which pertain to the topic in question. Essentially, most of the pages in the set should be relevant to one or more of the queries.

Having answered the first two questions, we move on to tackle the type of knowledge which should be acquired by the agent. We claim that the agent can limit itself to acquiring the following information:

  • A topic-specific lexicon, containing the dominant terms and lexical affinities of the domain (where lexical affinities are terms sharing a lexical relation and are identified as pairs of terms frequently co-occurring in close proximity [15]). As will be explained in Section 3, this lexicon is used both for query formulation assistance and as the source for corpus statistics that are required for the disconnected search process.
  • In-depth knowledge (the full text) of the top-notch Web resources on the topic. These resources, which will be termed the core set, include both authoritative pages on the topic (authorities) and hubs (pages which are themselves resource lists on the topic). See [12] for more precise definitions of hubs and authorities. Many typical queries on the topic will have their answers found in the core set of pages, and by indexing their full text, the agent will be able to satisfy these queries in disconnected mode.
  • Some knowledge of the second-tier of Web resources on the topic. Obviously, in order to keep the footprint of the KAB small, the size of the core set is quite limited and many valuable pages related to the topic will have been left outside of this set. However, while we may not be able to store the entire contents of these second-tier resources, it would be helpful to have some idea regarding their content, and to have the ability to suggest to users (in response to certain queries) that these URLs might be worth browsing at the next available moment of Internet connection.

Section 3 elaborates on how this knowledge base assists topic-specific disconnected information retrieval. Before moving on to detail the knowledge acquisition process, however, the key observation to be made is that the above knowledge is all implied by the core set of pages (and their associated text). Since the core pages are of the best available resources on the topic, their text should be representative of the domain language. It should contain topic-specific terminology, lexical affinities, and phrases. As for the required knowledge of second-tier pages, exactly such knowledge is contained in the hyperlinks which emanate from core pages. These links are associated with anchor text, which is often highly descriptive of the destination URL's contents. This is especially true for high quality pages, where the anchor text, like most other text, is highly informative. Moreover, since we aim to have top-notch hubs in the core set, their out-going links may be quite numerous and point to pages of high quality.

We call the URLs which are referenced by the core set satellite pages, and our experience shows that for core sets which contain about 100 pages, some 3000 satellite pages are referenced, with descriptive anchor text existing for many of these pages. Thus, the agent attains some knowledge on the contents of a much wider set of pages than it is actually able to store.

2.2 Knowledge Acquisition

Knowledge is acquired by collecting the core pages in an iterative process, governed by queries. Recall that the agent's creator defines the topic by a series of queries, which we will now denote by q1,...,qn. Denoting the core set following the i'th query by Ci, we have

Ci+1 = fka(Ci,qi) ,

where fka is the knowledge agent's core-set evolution-by-queries function. This function performs the process described below. Note that the initial core set, C0, can be empty or contain some creator-specified URLs.

The core set evolves by processing the series of training queries, two types of which are supported by the system:
  1. Text queries, which are keyword-based such as queries typically submitted to general purpose Web search engines. The agent will evolve its core set by searching for highly relevant resources to the query.
  2. Sample URL queries, which specify a few (typically 1-5) seed urls, and whose purpose is to find quality pages which are closely related to the seeds.

Thus, for each training query, the agent performs a Web search. For text-based queries, some general purpose Web search engines (of the agent's creator choice) are presented with the query, returning a root set of candidate pages. For sample URL queries, the user supplied URLs are the root candidates. The collection of root candidates is expanded into a larger set of candidate pages, S, by following the hyperlinks surrounding the root pages. The pages which currently reside in the core set are also added to S. The exact expansion model depends on the type of query which is being processed, and is out of scope of this paper. However, it is based on ideas presented in [12] and is fully explained in [2]. The resulting set of candidate pages, denoted by S, represents a directed subgraph of the WWW, whose nodes are the candidate pages and whose edges are the hyperlinks which connect the candidates.

The large set of candidate pages S are ranked by a combination of link-structure analysis and textual analysis. These analyses complement each other towards finding the pages which are both central to the agent's domain and satisfy the query well. The various components of the score are briefly described below, while the specific technical details appear in [2].

  1. The text analysis assigns each candidate page the following two separate cosine measure similarity scores, which attempt to grasp the extent to which the candidate answers the query, along with its general relevancy to the topic at hand. To ensure high precision, the basic indexing unit in these similarity scores is not a single term but a pair of correlated terms according to the so-called lexical affinity scheme introduced by one of the authors [15].
    • The similarity between the page's text and the text of the query. This is straightforward for text queries, and is adapted for sample-URL queries by regarding the aggregate text of the sample URLs as the text of a pseudo query.
    • The similarity between the page's text and the domain's terminology. The domain's terminology is taken as the aggregated text of the pages which currently occupy the core set. The effectiveness of this component of the score depends, naturally, on how representative the core set really is of the domain.
  2. Analysis of the link structure of the subgraph S assigns each candidate page with a hub score and an authority score. These scores are computed using a combination of Kleinberg-#146;s mutual reinforcement algorithm [12] and SALSA [14], a stochastic link analysis algorithm. Following [6], the links which are analyzed are weighted according to both the anchor text which is associated with them (and its similarity to the query), and according to their endpoints' membership in the evolving core set.

The textual and link analysis scores are then combined to yield the overall query relevance score for each candidate page. At this stage, the core set can be updated. One of the innovations of the KA model is that the query relevance score does not exclusively govern whether a page needs to be stored in the KAB. Rather, with each page which is already stored in the KAB (that is, with each current member of the core set) there is an associated fitness score, which reflects that page's relevance to the domain through the course of the previous iterations of the agent's training phase. The fitness scores of the current core set members are compared against the query relevance scores of the rest of the candidate pages. Pages compete for the right to be included in the KAB using an evolutionary adaptation mechanism, also described in [2]. Briefly, the first few iterations see the core set grow until it reaches its maximal size, in terms of number of pages, as defined by the agent's creator. Once the core set is full, subsequent iterations cause pages with low domain fitness scores to become stale. These are then removed from the KAB, thus vacating room for the new, fresh and fit pages. All pages which are entered into the KAB explicitly by the user receive a high initial fitness score. This conveys our high regard for the user's judgment of the quality of these pagess.

At the end of the training process, the KAB comprises a set of Web pages (and their associated fitness scores) that can be thought of as a set of category pages in some directory service. Unlike the latter though, the KA topic can be of any granularity and reflects the personal interests of its creator. These interests are not necessarily covered, or at least not in sufficient depth or specificity, by directory services. In addition, the KAB pages induce, as mentioned previously, a domain specific vocabulary, and a set of satellite pages.

While the training itself, may take a few hours (depending mainly on the network connection available) since several hundred Web paegs are crawled as part of this stage, it can be set up in a few minutes using a simple interface.

3. Information Discovery in Disconnected Mode

As mentioned above, once the KAB is resident on the device, there is no need to access the server for many further information discovery tasks as they are conducted locally on the device.

The first discovery task that is conducted on the device is topic browsing and exploration. The KAB pagess which are ranked by their fitness to the topic provide valuable information in the form of specialized bookmarks. Simply knowing the key Web pages on a given topic is a great source of information as demonstrated by directory services. Having them locally available on the device is convenient in a variety of applications. They are available for simple browsing as classical bookmarks but can also be conveniently searched as explained below. In addition to reading the top resources, the top terms in the KAB can also be explored. These terms are most likely good starting points for searches in the domain of the KAB.

The second discovery task is topic-specific information retrieval. The retrieval process typically involves the following three stages:

  1. Query formulation: Since we concentrate here on PDAs operated typically by stylus, or soft keyboard, world completion is not as crucial as on WAP phones for instance, yet it is a convenient feature as long as it is not too intrusive. Constantly changing the input string as characters are entered, as done in T9 [24] or iTAPtm [13], might be disorienting when using graffiti as users might think their graffiti was not properly recognized. Displaying optional completions, that change as input is entered seems like a reasonable solution that leaves quality control in the hand of the user. We suggest applying this approach not only at the word level but at the entire query level, by suggesting optional terms for completing and disambiguating queries as a word is completed (the end of a word being indicated by a space at the end of a string). Our key contribution in supporting query formulation assistance is 1) to use a topic-specific lexicon for both word and query completion as provided by the KAB and 2) to store the lexicon on the device for almost instantaneous response time. Once the query is formulated possibly after several refinement iterations, the search stage occurs.
  2. Search: The entire search phase is conducted exclusively on the KAB. The KAB pages are indexed on the server during the training phase using a full-text search engine. Recall that the KAB consists of core pages and satellite pages as described in Section 2. While the entire text of each core page is inserted into the index, only the anchor text which described each satellite page in the referring core page is inserted into the index. The satellite pages thus require little additional space in the KAB's index while significantly increasing the number of resources accessible from the KAB. The index is downloaded to the device as part of the KAB. Given a query, the inverted index is used to locate the KAB pagess that best answer this query. The results are ranked based on the similarity to the query as well as their fitness score. Since more qualitative pages in the KAB have higher fitness scores, the ranking of the results will reflect the similarity to the particular query as well as the page's authority in the KABs topic (this is similar in spirit to the PageRank [4] functionality of the Google search engine). Due to the locality of the inverted index, ranked search results are returned to the user in less than 1s response time on a typical Palm V device.
  3. Browse results: Result browsing is separated into high-level and low-level results. The high level results include only the title of the page in order to fit as many results as possible on the small device's screen. Note that in the case of satellite pages, the title is actually the anchor text of the referring page as the page itself is unavailable on the device. Next, more detailed information can be displayed for each result. This includes query-focused snippets for the core KAB pages, and the actual URLs for the satellite pages. Finally, the text of resident pages can be displayed in disconnected mode, while the satellite pages (as well as the original core pages) can be browsed only if using a wireless PDA and consequently may require a few seconds of waiting until they are retrieved from the Web. We believe however that this delay is acceptable because it is performed typically at the end of the search process and is not reiterated too often.

In the remainder of this section we provide more detail regarding topic-specific query formulation and the disconnected search process.

3.1 Query Formulation Assistance

Word completion

In addition to simplifying text input, word completion can also be very helpful in spelling terms correctly. In the spirit of source code editors that provide word completion based on the limited set of terms in the the programming language vocabulary, we provide word completion based on the domain-specific vocabulary encapsulated in the KAB. As keys are pressed, the agent suggests the most frequent words in the KAB vocabulary consistent with the input prefix. This list is sorted in decreasing order of frequency, increasing the chances that the first term suggested is what the user had in mind. The user can thus complete query terms with one click rather than many keystrokes. See Section 4, for examples of word completions using some sample KABs.

Query completion

Automatic query completion is usually performed by suggesting terms related to the query terms using some global semantic word network such as WordNet as suggested in [26]. In our case, on the other hand, query completion is based on the topic-specific knowledge as encapsulated in the KAB. The advantage of this approach is that the KAB's local vocabulary characterizes the domain's ontology and thus relations between terms are domain dependent. As a result, added terms disambiguate the query terms in the context of the specific domain.

The query completion process works by first clustering the terms in the query so that each cluster contains terms that most likely constitute a phrase and then identifying the best candidate expansion terms for each such phrase separately. The list of terms presented to the user is the union of each cluster's expansion terms. Since there is an upper bound on the total number of terms that can be suggested to the user (due to the small display size), each cluster is allotted the number of terms that it can contribute to this list based on the relative weight (in terms of frequency in the KAB) of the terms in the cluster. Term clustering as well as term suggestion is based on the lexical affinity relation. Recall that term t2 is considered a lexical affinity of t1 if t2 is found in proximity (e.g., within 5 words) to t1 in the text. The details of the query completion algorithm are formally expressed as follows:

Complete Algorithm

For example, consider the query "Circus dancer in Paris museums" and a KAB specializing in the french painter Toulouse-Lautrec. The terms "circus" and "dancer" are clustered together and the suggested terms in decreasing order of weights are "actress, performer, prostitute, moulin, rough, valentine, goulue". The terms suggested for "Paris" are "dorsay, musee, france, montmartre", and the term suggested for "museums" is "art". Assuming the number of suggested terms is limited to 9, and based on the computed cluster weights, the final list is "actress, performer, prostitute, moulin, rouge, dorsay, musee, art".

See Section 4 for a sample usage scenario using query completion as well as for several additional query completion examples.

3.2 Disconnected Search

As mentioned above, the KAB pages are indexed during the training phase using a text search engine. We use Palm Pirate (Palm Information Retrieval Application for Text Search) [20], a search engine for the Palm developed at the IBM research lab in Haifa. The system allows users to store, search and view text collections on their Palm device. Palm Pirate is composed of an indexing component which runs on the desktop and a search component which runs on the Palm. The KAB pages are indexed using this indexing component during the training phase. The index is then downloaded to the device as part of the KAB.

Palm Pirate uses a static inverted index, an index that does not support update operations such as insertion or deletion of pages. The reason that it is static is that the inverted index vocabulary is converted to a minimal perfect hash function which is a very efficient representation (in terms of storage) for such a vocabulary [28]. In order to build the perfect hash all keys need to be known in advance, making it a static data structure. We find that for small collections (such as KABs), static inverted indices are reasonable, as it takes little time to construct the entire index from scratch. The ratio between index size and the raw data is about 15%. The average time to construct the index is on the order of a few minutes, and the average time to process a query of 3-4 words is about 0.5 seconds regardless of the index size.

The search component of the Palm Pirate uses a standard tf-idf [23] based scoring mechanism. The word statistics stored as part of the KAB's lexicon serve as the source for term frequencies used by this algorithm. The score for a KAB page is a combination of its textual similarity score to the query and its fitness score. Recall that the fitness score of a KAB page reflects its relevance to the training queries and therefore to the KAB's domain. By using the fitness score, authoritative domain pages within the collection of search results are given priority in terms of ranking.

4. Sample Scenarios and Results

In this section, we present some scenarios that demonstrate several pervasive information discovery tasks being performed using our system. As mentioned in the introduction the premise is that that all the user has during the search and explore stage is a PDA and a modem or some other form of wireless communication. For example, he could be in the airport or on an airplane, in a museum, waiting for his child at soccer practice, or shopping.

We have created KABs on various topics ranging from technical fields such as Java, XML, Error Correcting Codes (ECC), to more recreational fields such as Toulouse-Lautrec, Broadway, Main Course Recipes (MCR), and Pokemon. We present one running scenario using the Toulouse-Lautrec KAB, as well as additional examples and statistics using some of the other KABs.

4.1 Agent Creation

Agents can be created from the desktop or the PDA. We offer both a "one-click" and an advanced interface. In addition, ready-made agents can be downloaded from an agent repository, or beamed from another PDA.

In the one-click interface, the user defines the agent by providing an optional list of seed URLs and 4-6 queries, which define the domain. In addition, the user must provide a name and description for the agent as well as their email address, which is used to identify them as the owners of the agent. In the advanced interface, users can create the agent in a more iterative fashion. They can inspect the results after each query, add new queries to agents that have already been trained, and add/remove pages from the KAB.

In Table 1 we present for four of our topics, the queries that we used to train an agent on the topic, the size of the corresponding KAB, the number of KAB pages, the precision of these pages (percent of pages relevant to the topic), the number of pages in the most closely related directory in Yahoo! [29] and the Open Directory as provided by Google [8].

Agent Training Queries KAB size # of KAB pages Open Directory Yahoo
Java
  • java programming language
  • java class libraries
  • java jdk
  • java compiler
  • java beans
  • 1.86MB 100

    precision: 95%
    1478 343
    Error Correcting Code
    (ECC)
  • "error correcting codes"
  • "linear codes"
  • "convolution codes"
  • "turbo codes"
  • "BCH codes"
  • 1.25MB 100

    precision: 84%
    closest category
    coding theory (28)
    no close category
    Toulouse - Lautrec
  • henri de toulouse-lautrec
  • paintings and posters by toulouse-lautrec
  • la goulue by toulouse-lautrec
  • toulouse-lautrec at the moulin-rouge
  • 430KB 60

    precision: 87%
    1 7
    Main Course Recipes
    (MCR)
  • chicken recipes
  • beef recipes
  • fish and seafood recipes
  • pork recipes
  • lamb recipes
  • 500KB 99

    precision: 99%
    closest category:
    recipe collections (486)
    closest category:
    (among 8)
    recipes (1057)
    Table 1: KAB statistics

    All of the agent training was done based solely on sample queries. We did not provide the agent with any seed pages since we wanted to assume the least involvement in terms of the user in the knowledge acquisition stage. Our experience has shown that the hub and authority based techniques used in training enable the agent to acquire the most prominent pages pertaining to the topic. In terms of precision, note that both the Java and Recipes agents have extremely high precision. Only five of the 100 pages, in the Java KAB are not pure Java pages (they are more general programming language pages), and only one out of 100 pages in the MCR KAB is not related to main course recipes. The precision of the ECC and Toulouse - Lautrec agents albeit being slightly lower, is still relatively high (~85%). Note that the non-relevant pages could be removed using our KAB editing functionality. However, in the spirit of assuming as little user involvement as possible, we left them in the KAB to demonstrate that even if some of the pages in the KAB are not relevant to the topic, this does not hurt the functionality of the system. In terms of the precision of the satellite pages in the KAB, an empirical study based on sampling showed that on average 75% of the satellite pages are relevant to the domain of the KAB.

    In comparison to commercial Web directories, only Java and Toulouse - Lautrec have corresponding directory entries in Yahoo! and in the Open Directory project. The Java category comprises many pages in both directories (343 and 1478 pages, respectively) and consequently would be too large to cache on a small device. The Toulouse - Lautrec directory, on the other hand, has very few pages and thus does not contain enough information to be useful as a Toulouse - Lautrec representative repository. There are no direct corresponding directories for either the ECC or MCR KABs and the most closely related categories (if any) are again not concise or highly representative of the topic.

    4.2 Topic-Specific Browsing and Exploration

    image0.gif (41734 bytes) Image1.1.gif (17365 bytes)
    Figure 2 - Selecting a KAB

    In Figure 2 we see the Toulouse - Lautrec KAB being selected from a pull-down list that contains the names of all the KABs available on the device.

    image7.gif (17961 bytes) Image6.gif (17991 bytes) Image15.gif (17187 bytes)
    Figure 3 - KAB exploration

    Once an agent is selected several operations are available which allow the user to explore the content of the KAB. Figure 3 depicts this functionality. The left-most screenshot shows the user simply viewing some basic information regarding the agent such as the name and description that the owner gave it as well as the queries used for training it. This information provides some feeling for the scope of the agent, and may be of interest when acquiring a new agent via downloading or beaming. The second screen shot shows the KAB pages sorted by their fitness score. That is, the first pages in the list are the top hubs or authorities in the domain. The third screenshot shows the top terms of the KAB. These terms are most likely good starting points for searching the KAB, and indeed clicking on any of them will initiate a search for pages that contain these terms. See Tables 3, 4, 5, 6 in the appendix for the top terms for some other KABs.

    4.3 Query formulation

    Figure 4 demonstrates the word completion functionality. After typing just two letters "am" and hitting the refine button only two words are suggested: ambassadeur, and american. For "amb" there is only one completion - ambassadeur. In contrast, the Broadway KAB would suggest the following words for the letters "am": american, amadeus, amateur, amazing, while there are no words suggested for "amb".

    As another example, consider the two letters "pi": The Pokemon KAB suggests the words: pikachu, picture, pinball, pikablu; The ECC KAB suggests: pietrobon, pipeline, pinch, pixel; The MCR KAB suggests: pie, piece, pineapple, pizza, pink. Clearly, no general purpose word completion could achieve such results.

    Image14.gif (17237 bytes) Image16.gif (17018 bytes) Image17.gif (17144 bytes)
    Figure 4: Topic-specific word completion

    Figure 5 shows the query completion functionality. After selecting the word ambassadeur from the word completion list and hitting refine again, the system suggests additional terms for the query. In this case the terms are: bruant, aristide, poster, lautrec, toulouse. The user selects bruant and aristide, to complete the query formulation stage.

    Image9.gif (16760 bytes) image10.gif (17174 bytes)
    Figure 5: Topic-specific query completion

    Some additional examples of query completion are provided in the following table:

    Agent Query term Query completion options
    Java bean java, pure, awesome
    Java java bean program, faq, tutorial
    Recipes bean white, lamb, free, middot, rice, grain
    Java block method, call, function, code, class
    ECC block turbo, code
    Pokemon: card trade, game, pokemon
    Java card game, java, applet
    Java free java, software, tool, system, operating, download
    ECC free code, distance, hamming
    Toulouse - Lautrec free poster, download
    Pokemon free pokemon, card, email, greeting, game
    Recipes red wine, dark, slightly, vinegar
    Pokemon red blue, pokemon, yellow, green, stadium
    Table 2: Query Completion Examples

    4.4 Disconnected Search

    Image4.gif (17549 bytes) Image8.gif (18154 bytes) Image5.gif (17747 bytes)
    Figure 6: Search results for "Aristide Bruant Ambassadeur"

    Figure 6 presents the results for the query "ambassadeur aristide bruant". The left-most image shows the first result screen. In this screen, only the titles of the top results are displayed (with no urls or snippets) in order to fit as many results as possible on the small screen. Once the user clicks on one of these titles, more information about this particular result is presented. More specifically, if the result is one of the core pages (for which we cache the text of the page locally), the title as well as a query focused snippet of the result with the query terms highlighted is displayed (middle image). The amount of information displayed at this level of granularity was also designed so it will more or less fill one full screen on the Palm device. On the other hand, if the result corresponds to a satellite/referenced page, we show the anchor text of the reference as the title and the url of the page itself (right image).

    Image12.gif (17073 bytes) Image13.gif (17639 bytes)
    Figure 7: Scanning selected results

    Moving along to Figure 7, the user has now chosen to access the result page by clicking on the title of the desired result. In case of core pages, the text of the page is retrieved from the cache. The text can be browsed with query terms highlighted (left image). Satellite pages need to be retrieved from the Web server where they reside by wireless communication. Note that this is the first time that network access is required. All of the steps described in the sample scenario so far where performed in disconnected mode.

    5. Conclusion and Future Work

    In this paper, we have presented a focused-search approach specifically designed for the mode of work and the constraints dictated by pervasive devices. We first identified the constraints inherent to pervasive devices as well as their implication on the search experience from such devices. We then argued that topic-specific focused search is a key to overcome these limitations. More specifically, we claimed that by coupling focused search with a knowledge representation that encapsulates domain knowledge in a concise yet representative fashion, it is possible to make information discovery from pervasive devices practical as well as efficient.

    We suggested a representation for topic-specific information based on "knowledge-agent bases" (KABs) where topics are represented and stored in a persistent repository with a very small footprint which can be downloaded to small devices. While knowledge acquisition is done on a desktop server by downloading KABs to the device, many information discovery tasks can be performed in disconnected mode. The KAB encapsulates all the information necessary to access information about a topic under the form of key concepts and key Web pages, to assist in query formulation and to perform topical searches on the device itself. Using this model, response time for many tasks is much better than in previous connected models. Several aspects of the method were demonstrated through a sample scenario and examples.

    With more mobile knowledge seekers turning to pervasive devices on a daily basis, the importance of improving the usability of information discovery tasks on such devices will become even more crucial in the future. While we have focused mainly on PDAs and in particular on PalmOS based devices in this paper, the future is clearly in mobile phones that have a much larger potential market. We believe that several aspects of our model are applicable and may be even more critical in this arena. Form factor, for example, imposes even heavier constraints and the query formulation assistance stage is even more crucial. Topic-specific knowledge encapsulation could thus result in even greater benefits. While current mobile phones do not have enough memory or processing power to use our KAB model as is, hopefully these devices will become more powerful and more importantly will open their OS so users can customize and exchange information.

    Acknowledgments

    We are in debt to Miki Herscovici and Doron Cohen for providing us with the Palm Pirate application, and their great expertise on PalmOS. We also thank Uri Weiss and Roni Raab for their contribution to the codebase and Rob Farrell for his feedback on user's experience.

    References

    1. AvantGo http://www.avantgo.com
    2. Y. Aridor, D. Carmel, R. Lempel, Y. Maarek and A.Soffer, "Knowledge Agents on the Web". In Proceedings of the 4th International Workshop, CIA 2000, Boston, MA, July 2000. LNAI 1860, pages 15--26, Springer.
    3. I. Ben-Shaul, M. Herscovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, V. Soroka and S. Ur, "Adding support for Dynamic and Focused Search with Fetuccino". In Proceedings of the 8th International World-Wide Web Conference, pages 575-588, Toronto, Canada, May 1999.
    4. S.Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine". In Proceedings of the 7th International Word Wide Web Conference, Brisbane, Australia, 1998.
    5. O. Buyukkokten , H. Garcia-Molina , and A. Paepcke, "Focused Web Searching with PDAs". In Proceedings of the 9th International Word Wide Web Conference, Amsterdam, May, 2000.
    6. S. Chakrabarti, B. Dom, D. Gibson, J. M. Kleinberg, P. Raghavan, and S. Rajagopalan. "Automatic resource compilation by analyzing hyperlink structure and associated text". In Proceedings of the 7th International World Wide Web Conference, pages 65-74, Brisbane, Australia, Apr 1998.
    7. S. Chakrabarti, M. van den Berg, and B. Dom. "Focused Crawling: a New Approach to Topic-specific Web Resource Discovery".In Proceedings of the 8th International World-Wide Web Conference, Toronto, Canada, May 1999.
    8. Google http://www.google.com
    9. M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. "The Shark-Search Algorithm - an Application: Tailored Web Site Mapping". In Proceedings of the Seventh International World Wide Web Conference. Brisbane, Australia, April 1998.
    10. Intellisync http://www.intellisync.com
    11. E. Kaasinen, M. Aaltonen, J. Kolari, S. Melakoski and T. Laakko, "Two Approaches to Bringing Internet Services to WAP Devices". In Proceedings of the 9th International Word Wide Web Conference, Amsterdam, May, 2000.
    12. J. M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment". In Proceedings of the ninth ACM-SIAM Symposium on Discrete Algorithms, volume 25-27, pages 668 -- 677, January 1998.
    13. iTAPtm http://www.motorola.com/MIMS/lexicus/products/iTAP.htm
    14. R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect," WWW9 / Computer Networks, June, 2000, Vol. 33, No. 1-6, pages 387--401.
    15. Y. S. Maarek and F. Smadja, "Full text indexing based on lexical relations, an application: Software libraries". In Proceedings of the Twelfth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 198-206, 1989.
    16. T. Masui. "An Efficient Text Input Method for Pen-based Computers". In Proceedings of the Conference on Human Factors in Computing Systems , pp. 328-335, 1998.
    17. Microsoft Internet Explorer http://www.microsoft.com/windows/ie/default.htm
    18. S. Mukherjea, "WTMS: A System for Collecting and Analyzing Topic-Specific Web Information". In Proceedings of the 9th International Word Wide Web Conference, Amsterdam, May, 2000.
    19. Palm http://www.palm.net
    20. Palm Pirate, available for download at IBM alphaworks site http://www.alphaworks.ibm.com
    21. Research in Motion http://www.rim.net
    22. J. R. Smith, R. Mohan and C.-S. Li, "Transcoding Internet Content for Heterogeneous Client Devices". In Proceedings of the IEEE International Symposium on Circuits Systems (ISCAS), Special session on Next Generation Internet, June, 1998
    23. G. Salton and M.J. McGill, "Introduction to Modern Information Retrieval", McGraw-Hill, NY, 1983.
    24. Tegic T9 Text Input http://www.t9.com
    25. B. Velez, R. Weiss, M. Sheldon, and D. K. Gifford, "Fast and effective query refinement". In Proceedings of the 20th International Conference on Research and Development in Information Retrieval (SIGIR-97), pages 6 -15, Philadelphia, PA, July 1997.
    26. E. Voorhees, "Using WordNet to Disambiguate Word Senses for Text Retrieval". In Proceedings of the 16th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, pp. 171-180, 1993.
    27. Wap Forum, http://wapforum.org
    28. I. H. Witten, A. Moffat, T.C. Bell, "Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann 1999.
    29. http://www.yahoo.com
    30. J. Xu and W.B. Croft, "Query Expansion using Local and Global Document Analysis". In Proceedings of the 19th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pages 4-11.

    Appendix

    KAB cached sites

    KAB lexicon
  • java.sun.com - The Source for Java(TM) Technology
  • JavaWorld November 2000
  • EarthWeb.com: The IT Industry Portal
  • Java Programming Resources -- Java, Java, and more Java
  • java.sun.com - The Source for Java(TM) Technology
  • The Java Tutorial
  • java program, java faq, java language,
    java tutorial, java resource, java developer,
    java applet, java platform, java technology,
    java code, java microsystem, java application,
    program language, java class, pure java,
    java api, java sdk, java security,
    java software, microsoft java, java bean,
    java interpreter, virtual machine, java compiler,
    java book, java servlet
    Table 3: KAB on Java

    KAB cached sites

    KAB lexicon
  • The Error Correcting Codes (ECC) Home Page
  • JPL Turbo Codes Page
  • ITR's Turbo Coding Home Page
  • Quantum Error-Correcting Codes
  • Madhya Pradesh MP India
  • An Introduction to Error Correcting Codes
  • Entropy in Information and Coding Theory
  • turbo code, error code, linear code,
    error correcting code theory,
    binary code, decode code, data compression,
    information theory, code length,convolutional code,
    parity check, block code, orthogonal array,
    matrix generation, iterative decoding,
    haming code, reed code, concatenated code
    Table 4: KAB on "Error Correcting Code"

    KAB cached sites

    KAB lexicon
  • Meat: Lamb Recipes
  • http://www.simply-recipes.com/pork_recipes.htm
  • http://www.intercom.net/user/sschoen/recipes.html
  • http://www.chickenrecipe.com
  • Seafood Recipes at WYK
  • The FisherNet Cookbook - Fish Recipes
  • Stinson Seafood Company: Recipes
  • GO.com:Beef recipes
  • http://www.travelguides.com/bandb/recipes/pork.html
  • Planned Leftovers - Lamb
  • Infoseek:Beef recipes
  • http://www.catechnologies.com/cuisinesites.html
  • http://travelguides.com/bandb/recipes/pork.html
  • chicken recipes, beef recipes, middot lamb,
    mimi chicken, hiller chicken, lamb recipe,
    leg of lamb, lamb chop, seafood recipe,
    pork recipe, meat recipe, crockpot recipe,
    chicken crockpot, fish recipe, jamie chicken,
    roast lamb, low fat, middot chop
    chicken salad, pork tenderloin
    Table 5: KAB on "main course recipes"

    KAB cached sites

    KAB lexicon
  • http://cruisenews.net/pokemon.htm
  • Pokemon World - The Official Pokemon Site
  • Official Japanese Pocket Monsters(Pokemon)
    Merchandise from Japanime.com and PocketMonsters.com
  • Pokemon Top 40 at Pokemon Nation
  • Pokemon ~ Pokemon video games,
    buy Pokemon toys, books and videos
  • http://www.pokemonvillage.com/topsites
  • A1 Pokemon !! - Pokemon news,
    pokemon cheats, pokemon pictures, etc
  • PsyPoke.com - pokemon, tcg, videos, sounds, cards...
  • http://www.pokemonvillage.com
  • Pokemon Top50 - Looking for Pokemon sites?
    You've reached the best place.
  • Universal Pokemon Network
  • Pokemon cards cheats and codes pokemon chat
  • Pokemon Power Plant
  • Pokemon Store
  • pokemon cards, pokemon game,
    pokemon code, pokemon card game,
    pokemon video game, pokemon pikachu,
    pokemon cheat, pokemon center,
    pokemon trading cards, pokemon world,
    pokemon pokedex, pokemon toy,
    picture code, pocket monster,
    game boy, pokemon guide,
    pokemon gameboy, pokemon silver gold,
    pokemon character, pokemon red,
    pokemon stadium
    Table 6: KAB on "Pokemon"

    Vitae

    Yariv Aridor is a research staff member at the IBM Research Laboratory in Haifa, Israel. He received his MSc. and PhD. in Computer science from the Tel-Aviv university, Israel, in 1989 and 1995, respectively. His research interests include distributed systems, mobile object-oriented systems and agent technology.

    David Carmel is a Research Staff Member at the IBM Research Laboratory in Haifa, Israel, and belongs to the 'Information Retrieval and Organization' Group. His research interests include information retrieval, multi-agent systems and artificial intelligence. He received his MsC and PhD in Computer science from the Technion, Israel institute of technology, in 1993 and 1997 respectively. David joined IBM in 1997 and has been involved with projects dealing with text mining, search applications, and information retrieval on the Web. He is currently involved in the Knowledge Agent project.

    Ronny Lempel is a Ph.D. Student in the Department of Computer Science, Technion, Haifa, Israel, focusing on WWW link-structure analysis. He received his B.Sc. and M.Sc. from the same department in 1997 and 1999, respectively.

    Yoelle S. Maarek is a Research Staff Member at the IBM Research Lab in Haifa, Israel and manages the "Information Retrieval and Organization" group that comprises about 20 members. Her research interests include information retrieval, Internet applications, and software reuse. She graduated from the "Ecole Nationale des Ponts et Chaussees", Paris, France, as well as received her D.E.A (graduate degree) in Computer Science from Paris VI University in 1985. She received a Doctor of Science degree from the Technion, Haifa, Israel, in January 1989. Before joining IBM Research in Haifa, Dr Maarek was a research staff member at the IBM T.J. Watson Research Center for about 5 years. She serves on the program committees of several international conference and is a member of the Review Board of the WebNet Journal. She has published over 25 papers in referred journals and conferences.

    Aya Soffer is a Research Staff Member in the 'Information Retrieval and Organization' group at the IBM Research Laboratory in Haifa, Israel. Her research interests include pictorial information systems, information retrieval, and non-traditional database systems. She received her MsC and PhD degrees in Computer science from the University of Maryland at College Park in 1992 and 1995, respectively.