Adaptive Data Model for Meta Search Engines

ADMIRE: An Adaptive Data Model for Meta Search Engines

Lieming Huang, Matthias Hemmje, Erich J. Neuhold
GMD-IPSI, Dolivostr. 15, D-64293, Darmstadt, Germany
{lhuang, hemmje, Neuhold}@darmstadt.gmd.de

ABSTRACT

Considering the diversity among search engines, efficient integration of them is an important but difficult job. It is essential to provide a data model that can provide a detailed description of the query capabilities of heterogeneous search engines. By means of this model, the meta-searcher can map users' queries into specific sources more accurately, and it can achieve good precision and recall. Moreover, it will benefit the selection of target source and computing priority. Because new search engines emerge frequently and old ones are updated when their function and content change, the data model needs good adaptivity and scalability to keep in step with the rapidly developing WWW. This paper gives a formal description of the query capabilities of heterogeneous search engines and an algorithm for mapping a query from a general mediator format into the specific wrapper format of a specific search engine. Compared with related work, the special features of our work are that we focus more on the constraint of/between the terms, attribute order, and the impact of logical operator restraints. The contribution of our work is that we offer a data model that is both expressive enough to meticulously describe the query capabilities of current WWW search engines and flexible enough to integrate them efficiently.

KEYWORD: Meta search engine, data model, wrapper, mediator, query capability mapping


1. INTRODUCTION

Meta search engines and other agent-based search tools provide a uniform query interface for Internet users to search for information. Depending on users' needs, they select relevant sources and map user queries into the target search engines, subsequently merging the results. However, considering the great diversity in schematic, semantic, interface, and domain aspects, it is very important but quite difficult to make full use of the functions of specific search engines. It is essential to provide a data model that can elaborately describe the query capabilities of WWW search engines. Based on such a data model, a meta search engine can achieve several advantages: (1) It will present to users a more sophisticated query interface rather than a "Least-Common-Denominator" or mixed interface; (2) It will make the translation of queries from mediators to selected wrappers more accurate; (3) It will let users get more complete and precise results; (4) It will improve source selection and running priority decisions. However, providing a detailed data model means that a more complicated algorithm is needed to cope with the constraints of or between query terms and the restraints of logical operators when translating queries from mediators to specific wrappers.

New search engines appear frequently. Old search engines are updated when their function or content changes. In order to keep up with the rapidly developing Internet, the data model must have good adaptivity and scalability. Designing a good data model is undoubtedly a challenging job.

We have built a meta search engine for users to search for scientific publications in the Internet. This meta search engine employs a "Mediator-Wrapper" architecture that has been used by many information integration systems. The mediator [28] provides users with integrated access to multiple heterogeneous data sources, while each wrapper represents access to a specific data source. Users formulate queries in line with the mediator's global view that is combined schemas [24] of all sources. Mediators deliver user queries to some relevant wrappers. Each selected wrapper translates user queries into source specific queries, accesses the data source, and translates the results of the data source into the information that can be understood by the mediator. The mediator then merges all results and displays them to users.

ADMIRE (Adaptive Data Model for Integration of seaRch Engines) is a data model that we use to describe the query capabilities of heterogeneous search engines. This kind of information has been extracted from the query interfaces of search engines instead of the schemas of them because most of such internal information can not be achieved by meta search engines.

This paper is organized as follows. We start by briefly introducing some related work. In section 3 we first give a formal description of the query input interface of heterogeneous search engines; then we model the query capabilities of three concrete search engines and a corresponding mediator. In section 4, based on this data model, we describe some problems regarding query mapping, and an algorithm for mapping queries from a general mediator format into the specific wrapper format of a specific search engine. Finally, section 5 concludes this paper and suggests some future work.


2. RELATED WORK

In the Internet there are a lot of meta search engines like "SavvySearch" (http://www.savvysearch.com), "Dogpile"(http://www.dogpile.com), "ProFusion®"(http://www.profusion.com), "Ask Jeeves" (http://www.askjeeves.com), "Highway 61" (http://www.highway61.com), "Cyber411" (http://cyber411.com), "Internet Sleuth" (http://isleuth.com/), "MetaFind" (http://www.metafind.com), "ONESEEK" (http://www.oneseek.com), MetaCrawler (http://www.go2net.com/search.html), "I.SEE"( http://www.cs.rpi.edu/research/isee/), etc. Although they integrate a lot of WWW search engines, most of their user interfaces are too simple. They only use a "Least-Common-Denominator" user interface and discard some of the rich functionalities of specific search engines. It is difficult for users to input complicated queries and retrieve specific information. This weakness is especially obvious when users want to search for scientific publications or specialized information. "From the users' perspective the integration of different services is complete only if they are usable without losses of functionality compared to the single services and without handling difficulties when switching from one service to another" (http://www.tu-darmstadt.de/iuk/global-info/sfm-7/). In order to avoid losing important functions of search engines, both generality and particularity should be considered when designing a data model for a meta search engine. Some other meta search engines display the searching controls of all search engines on one page or on several hierarchically organized pages, e.g. "All-in-One" (http://www.allonesearch.com) displays many original query interfaces on a single page.

Some systems deal with the description of source query capability. Information Manifold [21] uses capability records that specify 5 tuples of information w.r.t each source (input set, outputs of the source, the selections the source can apply, the minimum and maximum number of inputs allowed); and it gives an algorithm for query planning. [29] gives five attribute adornments f (for free), u (for unspecifiable), b (for bound), c[S] (for constant), o[S](for optional) to describe the query capabilities of wrappers and mediators and presents algorithms to compute the set of mediator-supported queries based on the capability limitations of its sources. [3] uses "Search Support Specification" to describe the capabilities of the search engines and discusses query relaxation and transformation. [2] uses Church-Rosser systems to characterize the query capabilities of information sources. [11] gives a detailed description of rewriting predicates like "contains" and word patterns, "equals" and phrase patterns, proximity operators. [10] and [8] introduce algorithms for generating the supported subsuming queries and filters. Tukwila [17] uses adaptive query operators to produce answers quickly, and dynamic collectors to organize access to redundant and overlapping information sources. [9] applies user-defined mapping rules to translate query constraints from the mediator to specific sources and discusses the converting methods between CNF and DNF. [27] uses the p-Datalog Language to describe sources and to answer queries. ARIADNE [1] uses the LOOM knowledge representation system for modeling data. Other projects (such as Disco [26], MeDoc [4], UNICATS [12], METALICA [25], etc) also make contributions on integrating information from multiple data sources. All these methods more or less address the problems discussed in this paper in certain respects. Some results are very useful for constructing a meta search engine. But they do not consider the following problem: The translation of a query will be limited not only by the constraints of some attribute modifiers, but also by the order of terms and the restraints of logical operators. This problem needs to be considered more elaborately; otherwise, fully exploiting the functions of heterogeneous search engines will be impossible. Because they do not deal with subtle differences comprehensively, most current methods use post-processing to refine the results. On the one hand post-processing costs a lot of CPU time, on the other hand in most cases it is impossible because the meta-searcher can not acquire the relevant information. Therefore, we should try to make use of the functions of heterogeneous search engines as much as possible. Only thus can we improve the processing speed and achieve more accurate and complete results. This is exactly what meta search engines should do.

Of course, if there are protocols or standardization for query models and interfaces of all search engines and for document construction, perfect realization of meta search engines will be as easy as falling off a log. Although there are a lot of efforts towards laying down all kinds of standards (such as Z39.50 [30], Dublin Core (http://purl.oclc.org/dc/), GILS [23], STARTS [14], XML, RDF, etc), for a number of reasons (e.g. a large amount of legacy information, authors unwilling to write articles complying with strict rules, great differences from one domain to another, etc), these standards are not being applied extensively. With the emergence of XML, some systems are using it to model information sources. MIX (Mediation of Information using XML) [5] uses XML as data model for semi-structured data and exploits the structuring information provided by XML DTDs. [16] also provides an XML-based data model. Just like Alon Levy [20] points out: "XML without agreed upon DTDs does nothing to support integration at the semantic level. The names and meanings of the tags used in XML documents are arbitrary. As a result, the emergence of XML is fueling activity in various communities to agree on DTDs". In [19], Nicholas Kushmerick also says that: "Data-exchange standards such as XML will simplify the information extraction, but they are not widely used. Furthermore, they do not entirely solve the problem, since they force the data consumer to accept the producer's ontological decisions." Such problems will require meta-searchers to translate between different DTDs. We can build our meta search engine based on some attribute sets (such as the metadata elements of Dublin Core, Z39.50-1995 Bib-1 and GILS attribute set), standards of query languages (e.g. the type-101 query of the Z39.50-1995 standards, STARTS protocol) and some information formats (e.g. Harvest SOIFs).


3. QUERY CAPABILITY DESCRIPTION

There are some commonalities among search engines. At the same time, many differences exist between them. Current meta search engines use these common features to build the integrated query interface and cast away the discrepancies. This will inevitably lose many important functions. They will spend a lot of time for post-processing the results or simply display all results to users, thus increasing the users' cognitive load. Making full use of the specific functions of search engines will alleviate such weaknesses. In order to accommodate for the frequent emerging of new search engines and updating of old ones, we also demand that our data model be adaptive and scalable enough. In this section, we provide a data model to formally describe the query capabilities of search engines. First, we would like to introduce some basic definitions. Then, we use these definitions to model three concrete examples. Finally, we will present the integrated interface of our mediator.

3.1 General Definitions

In the following we show the query pages of three typical scientific search engines with quite different query capabilities:  ACM-Digital Library (See Figure 3.1.1), NCSTRL (Networked Computer Science Technical Reference Library, See Figure 3.1.2) and IDEAL® (International Digital Electronic Access Library, See Figure 3.1.3).


Figure 3.1.1 the Query Page of ACM-DL     Figure 3.1.2 the Query Page of NCSTRL
Figure 3.1.3 the Query Page of IDEAL® Search

A query expression is constructed using a number of basic elements such as terms, operators and attribute constraints. In the following, we use a "bottom-up" strategy to introduce 12 definitions, which describe the query elements successively. The first six definitions (Terms, Fields, Qualifiers, Logical Operators, Constant Tree and Constant Set) are directly derived from the query interfaces. They correspond to the basic controls of the query pages like Input box, Check box, Radio box, Pull-down menu, List box, etc. The last six definitions are constructed based on the first six basic definitions.

Definition 1


Terms

A Term is the content keyed into an input box on the query interface of a search engine. For example, users can key in a single keyword, a phrase, or a Boolean expression in order to search for relevant information. In some cases, the input term may support wildcards, truncation, stemming, it may be case-sensitive, and might drop stop-words, hyphens, diacritics and special characters. This definition is different from the usual meaning of "term". This means that even if an input box can be filled with a complex query expression, we still call it a term.



Example 3.1.1 From Figure 3.1.1, we can see that  ACM-DL has six terms in which users can input keywords. Figure 3.1.2 shows that NCSTRL provides four terms; the first term and the other three terms, respectively, belong to two separate forms. In addition, in Figure 3.1.3 we see that IDEAL® Search provides three terms.

Definition 2


Fields

A field limits the scope of a term, i.e. it requires that the provided term has to be contained in the appointed part of the result. "Fielded search" usually means that keywords provided by users should be found in certain parts of a publication.


Example 3.1.2 F={<Title>, <Full-Text>, <Review>, <Article Keywords>, <Abstract>, <Author>, <Affiliation>, <Date>, <ISBN>, <ISSN>, <Journal Title>, <Citation>, <Editor>, <Anywhere>}

In Figure 3.1.1 we can see that the first four terms can be limited by arbitrary fields (Subset of {<Title>, <Full-Text>, <Review>, <Article Keywords>, <Abstract>}); while the fifth term and the sixth term can only be modified by the <Author> field. In Figure 3.1.2, in the first querying form, the term can be limited by several fields; while in the second form, each term of NCSTRL can only be limited by a certain field. And Figure 3.1.3 shows that each term of IDEAL® Search can be limited by one of {<Title>, <Abstract>, <Author>, <Affiliation>, <Date>} or all these 5 fields. It can not be limited by subset of all terms; while  ACM-DL can.

Definition 3


Qualifiers

A qualifier is used to describe the quality and form of the term input by users.


Example 3.1.3 Q={<Exactly Like>, <Multiple Words>, <Using Stem Expansion>, <Phrase>, <Expression>, <Sound Like>, <Spelled Like>, <Natural Language>}

For  ACM-DL, users can select qualifiers from the query page. The first term has three possible qualifiers {<Multiple Words>, <Phrase>, <Expression>}, while each of the other five terms has two qualifiers. For NCSTRL and IDEAL® Search, there is no qualifier controls for users to select. However, from the help files we know that the functions of qualifiers can be embodied in the terms. For example, the phrase qualifier can be applied by using quotation marks (""); Wildcard signs like '*' and '?' can be used to express stemming expansion. We can define some rules in specific wrappers for translating query expressions. For some search engines, the qualifiers of some terms can be <Expression>. At first, it seems that query mapping is easy. Unfortunately, each term will be limited by Fields, and in the meantime, the expression in a term will be limited by the same Fields. The mapping problem for this case will be discussed in section 4. In [11], there is a deep-going discussion about the conversion of qualifiers and qualified terms. However, it does not consider the constraints imposed by fields.

Definition 4


Logical Operators

A logical Operator is used to combine logically two terms to perform a search, the results of which are then evaluated for relevance.


Example 3.1.4 L={Ù, Ú, ¬,~ } where Ù means AND; Ú means OR; ¬ means NOT; ~ means NEAR.

Each logical operator has its own "sphere of action"; it can only join two terms. Some search engines have stricter restrictions for logical operators. For example, the two logical operators of NCSTRL must have the same value and can only be "AND" or "OR". This will greatly hinder the query mapping process. We must decompose the original query expression into equivalent sub-query expressions or minimal subsuming query sub-expressions. In [10], there is a detailed discussion about the general transformation of the users' queries into a subsuming query by the conversion of DNF and CNF. It does not cope with the constraints of fields and qualifiers on the terms.

Definition 5


Constant Trees

A constant tree is a hierarchical classification of constant attributes that can not be modified by users. Users can only select one or some of them to limit the search results. The constant tree is often implemented by using a list box with multiple items.


The following is a BNF definition of constant tree:


"IncludeOrNot=+" means that the parent node of a current sub-tree has a value that is not null; while "IncludeOrNot=" means that the parent node has no value. For example, NCSTRL only contains publications from the fields of computer science and technology, so it does not contain the parent node "All categories" (the root node in this case). It can be denoted as "<All Categories>{, <Computer Science&Technology> {+, <Software>, <Hardware>}}". Building the constant tree in such a way will benefit the integration of information and query mapping. In section 3.3, we will further discuss it.

We say that node A is an ancestor of node B if node A has a higher position than node B and there is a direct path (not passing through a node higher than or on the same level with node A) from node A to B. Node B is called a descendant of node A.

Example 3.1.5 ConstTree={Category ConstTree, Publication ConstTree}; Figure 3.1.4 displays an example of a Category ConstTree.


Figure 3.1.4 Example of Category ConstTree

Definition 6


Constant Sets

A constant set is a one-dimensional set consisting of some attributes that can not be modified by users. A constant set is often implemented by using check boxes, radio boxes, or pull-down menus.



ConstSet is a special case of ConstTree.

Example 3.1.6 ConstSet={Sorting Criteria ConstSet, Grouping Size ConstSet}

Sorting Criteria ConstSet = {<Relevance ranking>,<Author>, <Date>, <Institution>, <Title>, <Journal>}

Grouping Size ConstSet = {<10>, <20>, <30>,<50>, <100>}

As for the <Relevance ranking> in Sorting Criteria ConstSet, because each search engine has its own algorithm for computing relevance, we can not rearrange all items when merging results from various search engines. [15] provides a detailed discussion of the merging of ranked result lists from heterogeneous sources. W.r.t. Grouping Site ConstSet, if a wrapper does not have the same value, we can use a nearest value and finally use buffer technology to satisfy user demands.

Definition 7


Constant Terms

A constant Term is a term that can not be modified but only selected by users. This constant term can be a Constant Tree or a Constant Set.

ConstTerm = ConstTree | ConstSet  

Definition 8


Logical Expressions

Definition 9


Sequential Boolean Expressions
 


Definition 10


Priority Operators

Definition 11


Priority Boolean Expressions

Many search engines interpret the query expression sequentially. Our mediator has the choice of interpreting the query expression by priority or sequentially. If the query expression is based on priority, we will transform the query expression into disjunctive form; each sub-expression is a conjunctive expression or a single term. We will further discuss it in section 4.

Definition 12


Query Expressions

3.2 Wrapper Modeling for Some Examples

Many institutions provide search engines for Internet users to search through publication information, for example ACM-DL, NCSTRL, IDEAL®, Kluwer, Elsevier, etc. Each search engine collects its own publications. Perhaps there are some overlapping publications, for example: ACM-DL and NCSTRL.

In the following, we provide a description of some publication search engines. We let Fields F be as in Example 3.1.2; Qualifiers Q as in Example 3.1.3; Logical Operators L as in Example  3.1.4.

Example 3.2.1 ACM-Digital Library (http://www.acm.org/dl/newsearch.html) See Figure 3.1.1

The ACM Digital Library now offers about 95% of all ACM articles and proceedings back to 1991. A query expression is built through the combination of terms (the specified words or phrases, wildcards '%'), search options (Contains phrase / Exactly Like, Stem expansion, Fuzzy expansion / Spelled Like, Sounds Like) and logical Operators (AND, OR, NOT, NEAR). A query can be limited through the selection of the various categories presented. The refinement of a query is performed to further narrow down the result set obtained from a previous query. This result set is used as the scope within which the new query will run. A query can continue to be refined until the desired result is achieved.


Example 3.2.2 NCSTRL (http://www.ncstrl.org/) See Figure 3.1.2

NCSTRL (Networked Computer Science Technical Reference Library) is an international collection of computer science research reports and papers made available for non-commercial use from a number of participating institutions and archives. NCSTRL is based on Dienst [22], which is a protocol and server that provides distributed document libraries over the WWW. Dienst is based on a document model that incorporates unique document names, multiple document formats, and multiple document decompositions.



Each of the search engines of ACM-DL and IDEAL® has only one query form. Therefore, there is only one query expression for each of them. In NCSTRL, there are two separate query forms. The first query form allows users to search keywords in all fields; while in the second one, users must limit the keywords to a specific field.

Example 3.2.3 IDEAL® search (http://search.idealibrary.com/europe), See Figure 3.1.3

IDEAL® (International Digital Electronic Access Library) is an online electronic library containing 174 Academic Press journals. IDEAL® covers 11 categories (Biomedical Science, Business and Law, Engineering, Social Sciences, etc). If users find that an article in the returned results is close to what they are searching for, clicking "More Like This" will perform a new search using the full article as the basis for the search.



From the above three examples, we know there are many common and different points between search engines. Some provide rich and complicated functions, while others have simple functions. Controls of some search engines are less limited, while controls of other search engines are limited by all kinds of constraints. Only if all these features are understood by a meta search engine can it achieve better performance.

We have used definition 12 (Query Expression) to describe the query capabilities of search engines. The implementation described in the following will demonstrate that this data model can describe an integration of the capabilities of several search engines and that it is also efficient to be used for query mapping.

3.3 Description of the Mediator's Query Capability

3.3.1 Mediator

Figure 3.3.1 displays the query interface of a meta search engine for scientific publications. This mediator integrates several famous scientific search engines such as the three which are introduced in section 3.2, Elsevier, Kluwer, etc.


Figure 3.3.1 The query page of a mediator

In Figure 3.3.1, there is a check box on the right in the center part of the query page. If it is checked, the query expression will be interpreted by priority; otherwise, the query expression will be interpreted sequentially. Usually, users' query expressions are in disjunctive form like ((A AND B) OR (C AND D)). This kind of expression can be interpreted by priority. The sequential explanation of the example is (((A AND B) OR C) AND D), which has a different meaning from the original expression. In some cases, users' information must be expressed in conjunctive form like (A AND (B OR C OR D)). In our mediator, this example can be expressed in the form (B OR C OR D AND A), which has the same meaning as the original expression if interpreted in sequential order. Because of the space limitations of a query page, users can not express queries that are more complicated. In section 3.3.3, we can see that the progressive query interface will solve this problem to some extent. The conjunctive expression can be transformed into a disjunctive expression and vice versa.

In a sense, a mediator can be regarded as a union of all wrappers. In each wrapper, there is almost no conflict between controls (terms, fields, qualifiers and logical operators). However, for a mediator, there will inevitably exist all kinds of conflicts. For example, the term limited by the <Date> field should not be modified by <Sounds Like> or <Using Stem Expansion>. In order to reduce users' cognitive load, the query interface of the mediator should consist of only some common controls. It is really a dilemma. If the mediator's interface is too general, it will not make full use of the specific functions of wrappers. Therefore, we must define some rules to evaluate the expression. There are two ways to control the conflicts. The first method is to embed these rules in the querying interface on the client side; when users select one of the controls on the page, these rules will check if there will be a conflict, if so, then alert users. The second method is to analyze users' queries at the server. If there are serious conflicts, the system will let the users redo them; otherwise, if there are only some small conflicts, the system willcorrect them.

The query expression of the mediator can be described as follows:



3.3.2 Construction of the Mediator's ConstTerm

Figure 3.3.2 depicts how to construct the corresponding ConstTree of the mediator from the ConstTrees of all wrappers. While each search engine only refers to one or more special fields, some publishing houses refer to the knowledge of almost all domains. For example, the third wrapper ConstTree in Figure 3.3.2 is an instance of a search engine that only offers publications in the field of Artificial Intelligence. Its parent node is <Computer Science> and its grandparent node is <All Categories> (i.e. the root node). It is denoted as "A{, C{, I}}". Because search engines change easily, the ConstTrees of the mediator will adapt themselves to such changes. In the implementation, we use the B-Tree data structure to represent the ConstTree.


Figure 3.3.2 Construction of the mediator's ConstTree from ConstTrees of Wrappers

3.3.3 Progressive and Customizable Query Interface

Just like most other meta search engines, at the present time we build the query interface of our mediator as a static HTML page (See Figure 3.3.1). Although such a user interface is easy to create and maintain, it lacks flexibility and the interactive nature of an information retrieval dialogue between users and the system. The functionality offered to the users is also limited. On the one hand, we do not want to provide the users with an unwieldy interface filled with all kinds of controls that will make the users' heads swim. On the other hand, we can not simply extract some common attributes from most heterogeneous search engines to provide the users with a simplified interface that will not satisfy some specific information needs. During the information retrieval dialogue, the expression of the users' information needs is a self-refining process. Therefore, it is necessary to design a flexible and progressive query interface that is capable of supporting the iterative and self-refining nature of an interactive IR dialogue.

The search engine of Elsevier Science (http://www.elsevier.nl/inca/search/) provides such a progressive HTML query page. The start query page consists of only one input box with a pull-down menu of field modifiers (also including Category ConstTree, Publication ConstTree and Grouping Size ConstSet). Besides there is a link "ADD MORE FIELDS". When users want to add additional criteria to the search, they can click this link, then the second page appears. This page has two input boxes, two field modifiers and one logical operator. If users need to add more additional criteria to the search, they can click this link many times until they have finished the query construction. In consideration of the disadvantages of the static HTML page, there are some efforts towards building graphical user interfaces for meta searchers. In the MIX system [5], the user interface BBQ (Blended Browsing and Querying, which is driven by the mediator view DTD) allows users to define a query by clicking, dragging and dropping of visual objects. "SenseMaker" [7], an interface for information exploration running on the "InfoBus" (http://www-giglib.stanford.edu/diglib/), utilizes an interactive, structure-mediated approach. It supports the context-driven evolution of a user's interests [6]. AQUA [18] provides a progressive user interface written in a Java applet.

For a progressive query interface, the starting page usually consists of some common controls just like the "Least-Common-Denominator" interface supported by most search engines. During the user query process, the query pages change according to users' needs until the query is finished. Therefore, this kind of query interface has the advantages of both a simple and a sophisticated query interface. Sometimes users only need a simple interface to input keyword(s) without any extra controls. But sometimes users want to input complicated queries and they will complain about the lack of input controls.

A conjunctive query expression based on priority is difficult to implement in an HTML page because there is no bracket mechanism to improve the priority, e.g. "A AND (B OR C)" <>"A AND B OR C". A progressive interface will also be good for the generation of such expressions. We can use some visual tools to help users to generate complicated priority-based query expressions. For example, users can implement it by manipulating the tree control or dragging and dropping of visual objects.

Because the users of a meta search engine come from all kinds of application areas, it is favourable for users to be able to personalize their user interface. For example, some people have interests only in the field of computer science. In this case the meta search engine has to provide users with functionality to customize the query interface, such as selecting some relevant search engines and category items of some general-purpose search engines. Users can also set up other parameters like sorting criteria, grouping size, quality of results (layout, file format, field selection, etc). When a user customizes the interface to a certain extent, the resulting interface will be limited to some specific search engines. Therefore, the meta search engine will provide the function of customization and the possibility for users to save often-used query templates.

Both the progression support and the customization mechanism require that mediators have to become more flexible. However, no matter how the query page changes, we can use query expression (defintition 12) to describe its query capabilities.


4. QUERY CAPABILITY MAPPING

In section 3, we use the query expression to describe the capabilities of wrappers and mediator. Based on this data model, we now introduce how the meta search engine answers users' queries. When a user submits a query, the meta search engine first checks which search engines may have relevant answers to the user's query; then it maps the user's query to wrappers for all selected sources and dispatches the transformed queries to the corresponding sources. Then it post-processes the results from each source; finally all results are merged and displayed to the user.

4.1 Source Selection

When mapping a query, if a node <A> in a mediator's ConstTree is the same as or is the descendant node of a node <B> of a wrapper's ConstTree, attribute <A> will match to this wrapper.

When a user has finished the query construction and submits it, the meta search engine's mediator will judge which search engines will be used to answer the user's query. If the number of relevant search engines is large, the running priority order will also be decided. The rough description of the algorithm of source selection is as follows:



For source selection, [13] uses a meta-index approach which is a matrix (where the number of rows is the number of terms that have been used and the number of columns is the number of search engines). The meta-index tracks the effectiveness of each search engine in responding to previous queries. The meta-index grows as new terms are encountered. This method judges if a search engine will support a keyword based on experience. But because the contents of a search engine's collection and index always change, yesterday a search engine may have found 0 hits for the term "Artificial Intelligence", and today it may add some new papers on AI, but the meta search engine still excludes this search engine on the basis of the history records. The method of using experience can quicken source selection. For example, a search engine focusing on publications on mathematics can not support some terms from chemistry fields. Therefore, we can build a thesaurus to record some often-used terms from various fields and map these terms to some search engines.


4.2 Query Mapping Algorithm

Because the conjunctive expression in sequential order can be transformed internally into a disjunctive expression fitting priority explanation, we assume the query expression of the mediator is interpreted in priority order. How can we map such a query expression to a query understood by a specific search engine? First, the query expression has to be transformed into a standard disjunctive expression where the left part is some sets of conjunctive sub-expressions, while the right part is some sets of single terms; then we map all ConstTerms to the corresponding ones of the specific wrapper. After that, we map the main part of the query expression; for each of the sub-expressions of wrapper and each of the sub-expressions of the mediator, the meta search engine will check if they are compatible, if true then it maps this sub-expression. Otherwise the meta search engine will rearrange the order of terms or decompose the sub-expression into several smaller sub-expressions (equivalent or minimal subsuming), then continue checking the compatibility and continue the mapping. Finally, the meta search engine post-processes the results. This post-processing is not merely the union of all results. It will furthermore consider the former decomposition of the query expression and users' selections that the search engine can not support.

First we suppose the query expression of the mediator (i.e. users' query expressions) is:


And the query expression of selected source wrapper is as definition 12.

The following is the sketchy algorithm for translating a query expression from the mediator to a selected source wrapper. Here we ignore the coordination of conflicts from schematic, semantic and operational aspects, such as measuring and naming conventions (e.g., different expressions of date type: 19.11.1999=11/19/1999, different scaling methods: some systems use values between [0,1] to denote the relevance, while some use values between [0, 100]). Some search engines can return all information about an entry one time; while some search engines require users to visit their server more than one time, e.g. Cora (Computer Science Research Paper Search Engine, "http://www.cora.justresearch.com/") only returns the title, authors and abstract fields and links to a Postscript file and BibTeXEntry. The meta search engine needs to revisit the Cora search engine to get the information about book title and publishing date by following the link "BibTeXEntry"; and the URL of this "BibTeXEntry" is determined by session number and the id of this entry found on the previous result. The following algorithm supposes that all these kinds of translations and transformations are realized inside each wrapper.

Step 1:



Step 2:

For each of the ConstTerms in the mediator, the meta search engine will check if the user has selected one or more items. If the user has done so and the selected item(s) can be mapped into the corresponding ConstTree of the wrapper, then fill in the query parameters of the source wrapper using user-selected value or ancestor of the user's selected value; otherwise, fill in the default value (for ConstTree, it is often the value of the root node).

Step 3:






4.3 An Example and some Problems of Query Mapping

Example 4.3.1 Suppose that a user wants to search for publications as follows:

((Author is "Tom Bush") AND ("Information Retrieval" in All fields) OR (Title contains "User Interface") AND (("Metadata", "XML") in Keywords)) in the Computer Science category, published during the period of 1997 and 1999, the results to be sorted by date.



Now we map this query expression into the wrapper of the NCSTRL search engine. By using the algorithm in section 4.2, three query expressions will be generated conforming to NCSTRL: 1. ((Author is "Tom Bush") AND ("Information Retrieval" in Title)); 2. ((Author is "Tom Bush") AND ("Information Retrieval" in Abstract)); 3. (Title contains "User Interface") AND ("Metadata", "XML") in Keywords). After three visits to the NCSTRL search engine we get the raw results. Because NCSTRL does not provide the date selection and sorting function, the meta search engine needs to post-process the results and rearrange them in date order to meet the needs of the user.

In query mapping, if the query expression is not simple, some of the following problems maybe occur:

Relationship between Fields and Terms

In Figure 3.1.2, each term of NCSTRL is limited only by a certain field. When translating a query, we will rearrange the order of terms without changing the logical consistency.

Relationship between Terms and Logical Operators

The two logic operators of NCSTRL must be same and only have the values <AND> or <OR>. If the target wrapper demands that all fields be the same, we can transform the original query expression into equivalent Conjunctive or Disjunctive query expressions, e.g., (A AND (B OR C))= ((A AND B) OR (A AND C)).

Transforming PRI-Expression to SEQ-Expression

If the length of one conjunctive sub-expression is larger than that of the corresponding expression of a wrapper, the sub-expression of the mediator needs to be pruned. In most cases, it is impossible to post-process the broken conjunctive expression, because we can not get relevant information or the post-processing will cost unbearable CPU-time. The query expression of the mediator can be transformed into a disjunctive form expression, each of its sub-expressions is either a conjunctive sub-expression or a single term. When the system finds a wrapper expression corresponding to the conjunctive sub-expression, the latter will fill in the former; if there are free slots available, feasible single terms will be used to fill these slots in order to reduce the number of generated expressions and improve efficiency.

Term in Wrapper can be Expression

Some search engines allow users to input an expression in a term (an input box). Field modifiers will limit the use of expressions because all elements of the expression belong to the same field(s). We can extract sub-expressions with the same field modifier(s) to fill such term slots.

Post-processing

In some cases, the post-processing is impossible due to inaccessible information. For example, suppose that the query expression of the mediator is (A AND B AND C AND D) and the wrapper only supports two terms, and we decompose the original expression into two sub-expressions (A AND B) and (C AND D). If the four terms are limited to the fields of "abstract" or "full-text" of the publications, then we can not intersect the two result sets from (A AND B) and (C AND D) because we can not check whether a term is in such fields. Even if we can get such information (e.g. by analyzing the ps or pdf source file), such work is unnecessary. If the four terms are in the "title" field of the publications, it is possible to check if each item from the two result set contains these four terms. If the post-processing costs a lot of time, it is better to directly display the raw results to users.

When a meta search engine meets the above-mentioned problems, it will judge whether it is necessary to cost a lot of CPU time to make results more exact. For an information searcher sometimes the response time is considered first. Users are the best filters of the results.


5 CONCLUSIONS

This paper gives a formal description of the query capability of heterogeneous search engines and an algorithm for translating queries from a mediator to a specific wrapper. From the previous sections, we know that there is great diversity among search engines. Exact and efficient query mapping is a complicated and significant task. This paper only copes with the problems of the input interface. From the output page of search engines, we can get a lot of information such as long/short description, document type (technical report, paper, thesis, etc) and formats (ps, pdf, html, etc), different languages, related information. The combination of input interface modeling and output interface modeling would better satisfy users' needs. It is also important to apply the SPJ algebra to query planning of meta search engines. We think that it is also necessary to deeply research the following problems: Finding an efficient model of the constraints and dependence of fields, qualifiers, and logical operators; coordinating the relationship between query decomposing and post-processing.


ACKNOWLEDGEMENTS

Thanks to Barbara Lutes for discussion on this work.


REFERENCES

[1]     J. Ambite, N. Ashish, G. Barish, C. Knoblock, S. Minton, P. Modi, I. Muslea, A. Philpot, and S. Tejada. Ariadne: A system for constructing mediators for internet sources. In: Proc. of the ACM SIGMOD Intl Conf. on Management of Data. Seattle, WA, June 1998, pp. 561-563.

[2]     S. Adali, and C. Bufi. A Flexible Architecture for Query Integration and Mapping. The 3rd IFCIS Conference on Cooperative Information Systems (CoopIS'1998), New York. August 20 - 22, 1998, pp. 341-353.

[3]     S. Adali, C. Bufi, and Y. Temtanapat. Integrated Search Engine. Proceedings of the IEEE Knowledge and Data Engineering Exchange Workshop, KDEX97, Newport Beach, California. November 4, 1997, pp. 140-147.

[4]     A. Barth, M. Breu, A. Endres and A. de Kemp. Digital Libraries in Computer Science: The MeDoc Approach. Springer, 1998.

[5]     C. Baru, A. Gupta, B. Ludaescher, R. Marciano Y, Papakonstantinou and P. Velikhov. XML-Based Information Mediation with MIX. In: Proc. SIGMOD 99, Philadelphia, PA, June 1999, pp. 597-599.

[6]     M. Baldonado and T. Winograd. SenseMaker: An Information-Exploration Interface Supporting the Contextual Evolution of a User's Interests. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI '97), Atlanta, Georgia, April 1997, pp. 11-18.

[7]     M. Baldonado and T. Winograd. A GUI-Based Version of the SenseMaker Interface for Information Exploration. http://cs -tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/stanford.cs/CS-TN-98-67. 1998

[8]     B. Chidlovskii, U. M. Borghoff, and P.Y. Chevalier. Boolean Query Translation for Brokerage on the Web. In Proceedings of the 2nd International Conference EuroMedia/WEBTEC98, Leicester, U.K. January 5-7, 1998, pp. 37-44.

[9]     C. Chang and H. Garcia-Molina. Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. In: Proc. SIGMOD 99, Philadelphia, PA, June 1999, pp. 335-346.

[10]   C. Chang, H. Garcia-Molina and A. Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 4, Aug. 1996.

[11]   C. Chang, H. Garcia-Molina and A. Paepcke. Predicate Rewriting for Translaing Boolean Queries in a Heterogeneous Information System. ACM Transactions on Information Systems, Vol. 17, No. 1, January 1999, pp. 1-39.

[12]   M. Christoffel, S. Pulkowski, B. Schmitt, P. Lockemann and C. Schütte. The UniCats Approach - New Management for Books in the Information Market. In Proceedings of the International Conference IuK99-Dynamic Documents. 1999.

[13]   D. Dreilinger and A. Howe. Experiences with Selecting Search Engines Using Meta-Search. ACM Transactions on Information Systems. 1996.

[14]   L. Gravano, K. Chang, H. Garcia-Molina and A. Paepcke. STARTS: Stanford Protocol Proposal for Internet Retrieval and Search. In Proceedings of the 1997 ACM SIGMOD Conference (Tucson, AZ, May). ACM Press, New York, NY, 1997, pp. 126-137.

[15]   L. Gravano and H. Garcia-Molina. Merging ranks from heterogeneous Internet sources. In Proceedings of the 23rd VLDB Conference, Athens, Greece, August 1997.

[16]   R. Goldman, J. McHugh, and J. Widom. From Semistructured Data to XML: Migrating the Lore Data Model and Query Language. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB '99), Philadelphia, Pennsylvania, June 1999.

[17]   Z. Ives , D. Florescu , M. Friedman , A. Levy, and D. Weld. An Adaptive Query Execution Engine for Data Integration. In: Proc. SIGMOD Intl Conf. Philadelphia, PA, June 1999, pp. 299-310.

[18]    L. Kovács, A. Micsik, and B. Pataki. AQUA: An advanced user interface for the Dienst digital library system. In the 8th DELOS Workshop on User Interfaces for Digital Libraries, Stockholm, Sweden, Oct. 1998.

[19]    N. Kushmerick. Regression testing for wrapper maintenance. AAAI-99 Orlando. 1999.

[20]    Levy, Alon. More on Data Management for XML. May 9th, 1999. Available at http://www.cs.washington.edu/homes/alon/widom-response.html

[21]    A. Levy, A. Rajaraman and J. Ordille. Querying Heterogeneneous Information Sources Using Source Descriptions. Proceedings of the 22nd VLDB Conference Bombay, India, 1996.

[22]    C. Lagoze, E. Shaw, J. Davis and D. Krafft. Dienst: Implementation Reference Manual. 1995.

[23]    W. Moen, E. Christian, et al. Application Profile for the Government Information Locator Service (GILS). http://www.gils.net/prof_v2.html. 1997.

[24]    B. Panchapagesan, J. Hui, G. Wiederhold, S. Erickson, L. Dean and A. Hempstead. The INEEL Data Integration Mediation System. In AIDA'99, International ICSC Symposium on Advances in Intelligent Data Analysis, Rochester, NY, June 1999.

[25]    B. Schmitt and A. Schmidt. METALICA: An Enhanced Meta Search Engine for Literature Catalogs. In Proceedings of the 2nd Asian Digital Library Conferencs (ADL'99), 1999.

[26]    A. Tomasic, L. Raschid and P. Valduriez. A Data Model and Query Processing Techniques for Scaling Access to Distributed Heterogeneous Databases in Disco. Invited paper in the IEEE Transactions on Computers, special issue on Distributed Computing Systems, 1997.

[27]    V. Vassalos and Y. Papakonstantinou. Describing and Using Query Capabilities of Heterogeneous Sources. In: Proc. 23rd VLDB Conf. Athens, Greece, pp. 256-265.

[28]    G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, 25(3), March, 1992, pp. 38-49.

[29]     R. Yerneni, C. Li, H. Garcia-Molina, and J. Ullman. Computing Capabilities of Mediators. In: Proc. SIGMOD, Philadelphia, PA, June 1999, pp. 443-454.

[30]     Z39.50 Maintenance Agency. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. ftp://ftp.loc.gov/pub/z3950/official/. 1995.


VITAE

Lieming Huang, a computer scientist now is working and pursuing Ph.D. degree in GMD-IPSI. In 1994, he got Bachelor of Science degree on computer software from Department of Computer Science and Technology, Peking University, Beijing, P.R.China. In 1997, he got Master of Engineering degree on computer software from the Chinese Academy of Sciences, Beijing, P.R.China, and his Master thesis is on intelligent user interface for database systems. He is interested in the research fields of meta data, information retrieval, data management, and user interface.

 

Dr. Matthias Hemmje is division manager of the DELITE digital libraries research division at GMD-IPSI in Darmstadt, Germany. He holds a diploma and a Phd degree from Department of Computer Science of the Technical University of Darmstadt. His research interests include information retrieval, multimedia databases, agent based user interfaces, virtual environments, information visualization, viusal interaction, multimedia, and evaluation of interactive systems. From 1987-1991 he worked as a Systems Engineer, Dept. for Computer Aided Testing at Gebr. Hofmann KG, Darmstadt, from 1991-1999 he worked as a research associate at GMD-IPSI in the department for Visual Interaction Tools (VISIT) and in the research division for Open Adaptive Information Management Systems (OASYS). Matthias Hemmje is working in the context of R&D related to VRML-, MPEG-, and XML-based information retrieval and information visualization systems, as e.g., the Internet virtual gallery project (i-VG), the Congress Online (CO) congress information system which is based on a database supported Information Catalogue Environment enabling navigation on multimedia document collections. Matthias Hemmje is a member of the Visual Information Retrieval Interfaces (VIRI) working group founded by Bob Korfhage at the Univ. of Pittsburgh and he is a member of the European working group on Foundations of Advanced 3D Information Visualization (FADIVA).

Prof.Dr. Erich J. Neuhold is the director of Institute for Integrated Publication and Information Systems (IPSI) of the German National Research Center for Information Technology (GMD) and at the same time a professor of Department of Computer Science, Technical University of Darmstadt in Darmstadt, Germany. He received his M.S. in Electronics and his Ph.D. degree in Mathematics and Computer Science at the Technical University of Vienna, Austria, in 1963 and 1967, respectively. In 1986, he was appointed Director of the Institute for Integrated Publication and Information Systems of the German National Research Center for Information Technology in Darmstadt. His primary research and development interests are in heterogeneous interoperable database systems, object-oriented multimedia knowledge bases and intelligent information retrieval. He also guides research and development in user interfaces including virtual reality concepts for information visualization, computer supported cooperative work, virtual meetings and conferences as well as integrated publication and information systems with special emphasis on multimedia hyperdocuments and on information mining in the Internet/WEB environment. National and international cooperation with research and industrial partners ensures the transfer of results into widely available prototypes and products. Since 1989 he is also Professor of Computer Science, Integrated Publication and Information Systems, at the Darmstadt University of Technology, Germany. He has published 190 papers, 4 books and 9 edited books.