Searching with Numbers
Rakesh
Agrawal and
Ramakrishnan
Srikant
IBM Almaden Research Center
650 Harry Road, San Jose, CA 95120
Note: We strongly recommend printing the PDF version of this paper.
Copyright is held by the author/owner(s).
WWW2002, May 711, 2002, Honolulu, Hawaii, USA.
ACM 1581134495/02/0005.
Abstract:
Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and RetrievalGeneral Terms
Theory, Algorithms, Experimentation
1. Introduction
'o dear ophelia, I am ill at these numbers; Hamlet: II, iiNumbers play a central role in modern life. Yet the current search engines treat numbers as strings, ignoring their numeric values. For example, as of this writing, the search for 6798.32 on Google yielded two pages that correctly associate this number with the lunar nutation cycle [18]. However, the search for 6798.320 on Google found no page. The search for 6798.320 on AltaVista, AOL, HotBot, Lycos, MSN, Netscape, Overture, and Yahoo! also did not find any page about the lunar nutation cycle.
A large fraction of the useful web comprises of what
can be called specification documents.
They largely consist of attributevalue pairs surrounded with text.
Examples of such documents include product information,
classified advertisements, resumes, etc.
For instance, Figure 1 shows a partial extract
of the data sheet for the Cypress CY7C225A PROM.
A design engineer should be able to ask a query that looks something like this:
address setup speed 20 ns power 500 mW CMOS PROM
and get the CY7C225A data sheet.
None of the search engines
could find this datasheet using the above query
(or its variations).
We were able to get the data sheet when we provided the exact numeric
values for the speed and power attributes
because the search engines could then match the string.
It is unreasonable to expect that the user will provide exact numeric values
when doing such searches.
In fact, users typically search for items whose specifications roughly
match the values provided in the query.

The approach taken in the past to retrieve the specification documents has been to extract the attributevalue pairs contained in a document and store them in a database. Queries can now be answered using nearest neighbor techniques [1] [15] [21]. There has been some research on automating the task of data extraction (see surveys in [7] [19]). However, the automatic extraction of attributevalue pairs has a spotty record of success. It is a hard problem, exacerbated by the fact that it is often difficult to identify attribute names and establish correspondence between an attribute name and its value. Very often, different documents refer to the same attribute by different names. We experienced firsthand this problem in building an experimental portal for electronic components, called Pangea. For semiconductor components alone, we obtained more than 200,000 datasheets from nearly 1000 manufacturers. Our effort to extract parametric information from these datasheets met with only limited success. It is noteworthy that the major content companies in electronics industry employ a large number of people who manually extract the parametric information.
This paper proposes a new approach to retrieving specification documents. In essence, we are suggesting that it is not necessary to establish exact correspondences between attribute names and numeric values in the data. An user query can instead choose to provide only values, without specifying corresponding attribute names. For this approach to work, data must have low reflectivity. This property is exhibited by many real world datasets and the extent to which a dataset satisfies this property can be computed independent of the query distribution. For a simple example of a nonreflective dataset, assume that the documents contain only two attributes: `memory' and `disksize'. Further assume that the range of values for memory is 64 to 512 and that for disksize is 10 to 40. Given a query {20, 128}, the system can correctly retrieve documents that have disksize and memory values close to 20 and 128 respectively. In this example, the attributes have nonoverlapping domains. However, low reflectivity is not limited to data having such nonoverlapping attributes. If memory and disksize overlapped, but were correlated such that high memory configurations had high disksize, the data would still have low reflectivity.
The target repositories for our techniques are document collections on focused topics. Classification and clustering techniques can often be applied to partition a general repository into a set of topicspecific repositories. Our techniques can also be applied in other applications where providing attribute names in a query is difficult or inconvenient. For example, in a federated database system, the same attribute might be called with different names in different constituent databases [8] [16] [17].
The techniques we propose complement current search technology. One can envision a system in which the ranked results produced using our techniques are combined with the results obtained using words and links in the pages [2] [3] [9] [22]. Techniques such as [10] can be used for combining the results.
The rest of the paper is organized as follows. Section 2 provides the data and query models. We discuss reflectivity in Section 3. We present algorithms for finding matching documents without hints in Sections 4, and with hints in 5. Section 6 gives experimental results showing the accuracy and performance characteristics of our techniques. We conclude with a summary and directions for future work in Section 7.
2. Model
We assume that a document is preprocessed to extract numbers appearing in it. For each number, the potential attribute names are optionally extracted by examining the text surrounding the number.^{1} This extraction need not be precise. One may associate zero or multiple attribute names with a number depending on the confidence on the quality of extraction.
To simplify exposition, we will initially ignore units normally associated with numeric values. In Section 5.2, we discuss the handling of units within our framework.
At the time of querying, the user may simply provide numbers, or optionally specify attribute names with numbers. The attribute names provided in a query may not correspond to what are present in the data since the same attribute may be called with multiple names, e.g., salary, income, pay, etc. We consider nearest neighbor queries where the user is interested in retrieving the top t documents containing values close to the query terms.
2.1 Database and Queries
Let
be the universe of numbers and
the universe of attribute names.
The database
consists of a set of documents.
A document
is defined to be
where H_{i} is the set of attribute names (hints) associated with the number n_{i}and m is the number of unique number, attribute namespairs present in D. Note that it is possible to have n_{i} = n_{j} but , since more than one attribute name can have the same numeric value. An attribute may have a set of values (e.g., different processor speeds in the same datasheet) and hence it is also possible to have H_{i} = H_{j} but . Duplicates (n_{i} = n_{j} and H_{i} = H_{j}) may optionally be admitted by treating D as a multiset.
If the document does not have hints associated with numbers,
D is simply a multiset:
A search query Q consists of
Here each pair represents a query term and k is the number of query terms present in Q. A_{i} is the set of attribute names associated with the number q_{i}. We allow the user to associate a set of attribute names with a number since there may be synonyms. There is an implied conjunction between the query terms. It is possible to have q_{i} = q_{j} but , i.e., a query may contain more than one occurrence of the same number.
If the query does not have hints associated with the numbers,
Q is simply a multiset:
2.2 Matching Without Attribute Names
The goal for any search is to return documents that are most similar to the query, ordered by their similarity score. The real problem lies in defining similarity. The generally accepted approach is to use some L_{p} norm as the measure of distance and take similarity to be inverse of this distance.
Consider a query system where attribute names are not available in either the data or in the query. For a query , there must be a true attribute that the user has in mind corresponding to each number q_{i} in Q. Similarly, for a document , each number n_{i} in D has a corresponding true attribute.
We call a document D a true close match to a query Q if D contains a set of k numbers such that the distance between Q and D' is small (i.e., less than some constant r) and both q_{i} and n_{ji} are the values of the same true attribute. We have treated both Q and D as ordered sets in this definition.
We call a document D a nameless close match to a query Q if D contains a set of k numbers such that the distance between Q and D' is small. Unlike the case of a true close match, the true attributes corresponding to n_{ij} and q_{i}need not be the same. Notice that by requiring a set of knumbers in D, we impose the desired constraint that a number instance in the document should not match multiple query terms.
In other words, users do not ask queries by combining attribute values at random, but rather tend to combine them in ``realistic'' ways.
Informally, if we get a good match on the numbers between a document and query, then it is likely that we will have correctly matched the attributes as well. Thus for datasets where this conjecture holds, we can match using only the numbers, and still get good accuracy compared to the benchmark where we know the true attributes in both the documents and the query.
In Section 3, we define a property called reflectivity that allows us to quantify the extent to which this conjecture holds in a given dataset. In fact, if the query distribution for some set of true attributes is the same as the data distribution projected onto that subspace, the likelihood in the above conjecture is the same as the value of nonreflectivity.
3. Reflectivity
Consider the data shown in Figure 2(a) for two attributes a_{1} and a_{2}. The diamonds indicate the points actually present in the data. This data is completely nonreflective: for any point present in the data, its reflection does not exist. Figure 2(b) gives an example of clustered data that is almost completely nonreflective. The correlated data in Figure 2(c) is also largely nonreflective. However, the data in Figure 2(d) is highly reflective.
(a) Nonreflective  (b) Low Reflectivity 
(c) Low Reflectivity  (d) High Reflectivity 
The implication of reflectivity is that the queries against a low reflectivity data can be answered accurately without knowing the attribute names, provided the realistic query conjecture is true. Hence, although there is complete overlap between the range of values of the two attributes in Figures 2(a)(c), we will get high accuracy on any 2dimensional query. For example, consider the query on the data in Figure 2(a). This query can only map to since there are no points in the region around .
Queries will often span a subset of the dimensions in the dataset, and reflectivity will depend on the exact set of dimensions being considered. Consider the query on the data in Figure 2(a). This query can map to either or , and the answer to this query will consist of points for which either a_{1} or a_{2} is close to 20. Thus precision on this query will be around 50%, in contrast to the close to 100% precision that we can get on the query for the same dataset. Similar behavior is exhibited by data in Figures 2(b)(c): they are highly nonreflective in 2 dimensions, but quite reflective in either of the 1dimensional projections.
Before formally defining reflectivity, we make the following observations.
 Above, we took a given query Q and checked whether or not the reflections of Q coincided with other data points. However, for similarity queries, we care not only about the exact query values, but points close to the query values. Hence we should look at the number of points within distance r of each reflection of Q.
 Rather than taking a query and considering whether there are points close to the reflections of the query, we take a dual viewpoint. For a given query Q, we consider how many reflections of other points are within distance r of Q and compare this number with the number of points that are truly within distance r of Q.
3.1 Definition
Let be a set of mdimensional points of cardinality . Let denote the coordinates of point x_{i}. We first define reflectivity over the full mdimensional space, and then give a more general definition for subspaces.
We define the reflections of the point x_{i} to be the set of coordinates obtained by permuting (including ). For example, if x_{i} were , the reflections of x_{i} would be { , }.
Let
denote the number of points within distance
r of
(in mdimensional space).
The value of r is so chosen that the average value of
(over all
)
is close to the number of top answers that users will be interested in.
Let
denote the number of points
in
that have at
least one reflection within distance r of .
The reflectivity of
in mdimensional space
is then defined to be:
Now consider a kdimensional subspace S of the space of . We define the kreflections of a point x_{i} in to be the set of coordinates obtained by considering the k! permutations of ^{m}C_{k}combinations of k coordinates chosen from . For example, the 2reflections of a 3dimensional point will be the set { , , , , , }.
Let
represent the coordinates of point x_{i} projected
onto this subspace.
Let
denote the number of
points in
whose projections onto the subspace Sare within distance r of the coordinates
(in the
kdimensional space). As before, the value of r is so
chosen that the average value of
(over all
)
is close to the number of desired top answers. that
Let
denote the number of points in
that have at least one
kreflection within distance rof the coordinates
(in the kdimensional space).
The reflectivity of the subspace S is defined to be:
Finally, let
represent the set of kdimensional
subspaces of .
Let
denote
the number of kdimensional subspaces
Then, the reflectivity of
over kdimensional subspaces
is defined to be the average of the reflectivity in each subspace:
Note that:
3.2 Implicit Name Match Conjecture Revisited
Let S be the subspace corresponding to the attribute in a query . Let denote the coordinates corresponding to . Then there are documents that are true close matches to Q, and documents that are nameless close matches to Q. Hence for this query, the probability that a document that is a nameless close match will also be a true close match is simply .
Let us represent a query distribution for a subspace S by a random
sample of queries
drawn
from that distribution. Then for a query belonging to this
distribution, the probability that a document that is a nameless close
match will also be a true close match is
Finally, if the query distribution for a subspace S is close to the distribution of documents projected onto S, we can treat the set of points as a sample of the query distribution, and the probability that a document that is a nameless close match will also be a true close match is simply Nonreflectivity(S, r). Thus reflectivity can serve as a close proxy for expected accuracy.
3.3 Computing Reflectivity
If the database knows the correct attribute names corresponding to data values, we can use Eq. 7 to compute reflectivity. Null values in the data can be handled as follows. Suppose the values of some of the coordinates of an mdimensional point x_{i} are null (unknown). We ignore this point in the computation of reflectivity in Eq. 5. When computing the reflectivity of a subspace S in Eq. 6, the term for point x_{i} is excluded from the summation if a null value is present in the set of coordinates projected onto S. However, x_{i} may still contribute to the denominator of the above term for some other point x_{j} if a kreflection of x_{i} does not have any null coordinate values and this reflection is within distance r of .
The cost of computing reflectivity can be reduced by doing the summation in Eq. 6 for a sample of data points. Similarly, we can do summation over a sample of subspaces in Eq. 7.
Consider now the scenario where attribute names have not been assigned to most of the values in the documents. If we can get hints for the attribute name, we can treat the highestranked hint for each number as the attribute name and compute reflectivity. We empirically evaluate this idea in Section 6.6. Our results show that this idea tends to be useful if the accuracy of hints is relatively high.
Finally, consider the situation in which we do not even have good hints. The proposed techniques may still work well; we just will not be able to estimate accuracy a priori. If the answers displayed to the user show sufficient summary information that the user's selections are a reliable indicator of the accuracy of the answer, we can use the precision of the answers as a rough estimate of reflectivity.
3.4 Remarks
 Nonoverlapping Attributes: If the attributes of a dataset
do not overlap, such data
is necessarily nonreflective for queries of any length.
 Clustering and Correlation: For a fixed amount of overlap between attributes, clustered and/or correlated datasets are likely to have lower reflectivity than datasets where the attributes are independent. Figures 2(b) and (c) support this intuition. In Section 6.5, we quantify this effect for nine realworld datasets by computing reflectivity both for the original dataset and for a modified dataset where we destroy any correlation or clustering while keeping the actual values of each attribute fixed. Destroying correlation and clustering increases reflectivity in all nine datasets, often dramatically. Of course, it is easy to come up with counterexamples of correlated or clustered datasets that are quite reflective, as shown in Figure 3.
(a) Clustered & Reflective  (b) Correlated & Reflective 
4. Algorithms
We now give algorithms for finding documents in response to a user query. These algorithms assume that the database as well as queries have no attribute names associated with the numbers. Section 5 discusses how to take advantage of this extra information.
Recall that a document D consists of (Eq. 2). A search query Q consists of (Eq. 4). Note that both D and Q are multisets. Each value corresponds to an unspecified attribute name.
In computing the distance of query Q from a document D,
each q value is matched with exactly one n value.
Given a set of query numbers
and a set of
matching document numbers
,
the distance function F with the L_{p} norm (
)
is defined as
where w(q_{i},n_{ji}) is the distance between q_{i} and n_{ji}. We expect that F will typically use relative distance, since otherwise some of query terms will get disproportionate importance and other query terms will be ignored. For example, w(q_{i},n_{j}) may be defined as .
Maximizing similarity is equivalent to minimizing distance.
4.1 Matching a Document to a Query
Given a set of k numbers, and a set of m document numbers, we want to select the numbers in D that will lead to the minimum distance. Each number in D is allowed to match with a single number in Q, and vice versa.
Construct a weighted bipartite graph G as follows:
 Create k source vertices labeled corresponding to k numbers in Q.
 Create m target vertices labeled corresponding to m numbers in D. If m< k, add km target vertices with value .
 From each source vertex q_{i}, create an edge to the k closest target vertices in { }.^{2} Assign weight w(q_{i},n_{j})^{p} to the edge (q_{i},n_{j}).
Figure 4 shows the weighted bipartite graph for Q={20,60} and D={10,25,75}, assuming the distance function to be L_{1} and .
Lemma
The optimum solution to the minimum weight bipartite graph matching problem for
the graph G matches each number in Q with a
distinct number in D such that we get the lowest value for the
distance score F(Q,D).
We have marked in bold the edges comprising the optimum solution for the graph in Figure 4. Thus, 20 in Q is matched with 25 in D and 60 with 75 for a total distance score of 0.5.
We can now refer to the rich weighted bipartite graph matching literature (see survey in [4]) to find the best matching between the numbers in a query and the numbers in a document. We also obtain the distance score at the same time, which is used for ranking the documents. By repeating this process for every document in the database, we have a solution to our problem. In Section 4.2, we present techniques that avoid examining every document.
The best known algorithm for the weighted bipartite matching problem is due to Feder and Motwani [13] and its time complexity is , where e is the number of edges in the graph. Since , the complexity is .
4.2 Limiting the Set of Documents that are Matched
We now address the question of how to limit the number of documents for which we have to compute the distance. This problem turns out to be similar to that of retrieving the top t objects that have highest combined score on k attributes, introduced in [11]. We first describe the score aggregation problem and the threshold algorithm for solving this problem [12] [14] [20].
Score Aggregation Problem
Assume that each object in a database has k scores, one for
each of k attributes.
For each attribute, there is a sorted list, which lists each object
and its score for that attribute, sorted by score (highest score first).
There is some monotone aggregation function f for combining
the individual scores to obtain an overall score for an object.
The problem is to efficiently find the top t objects that have the
best overall score.
Threshold Algorithm (TA)
There are two modes of access to data.
Sorted access obtains the score of an object in one of the sorted
lists by proceeding through the list sequentially from the top.
Random access obtains the score of an object in a list in one access.
The threshold algorithm works as follows
[12].
 1.
 Do sorted access in parallel to each of the k sorted lists L_{i}. In other words, access the top member of each of the lists under sorted access, then the second member, and so on. As an object D is seen in some list, do random access to other lists to find score s_{i} of object D in every list L_{i}. Then compute the overall score of object D. If this score is one of the t highest we have seen, then remember D and its score f(D).
 2.
 For each list L_{i}, let be the score of the last object seen under sorted access. Define the threshold value to be . Halt when t objects have been seen whose overall score is at least equal to .
 3.
 Let Y be a set containing the t objects that have been seen with the highest scores. The result is the graded set .
Proposed Adaptation
We now discuss how our problem is similar to the score aggregation problem.
We then show how the threshold algorithm can be adapted to our problem.
Assume that the documents have been processed to create data structures
to support the following types of accesses.
 Database Access: Given a document id, return the multiset of numbers present in the document.
 Index Access: Given a number, return the set of documents in which this number is present. Only numbers that appear in at least one document are included in this index. Numbers are kept sorted so that it is easy to determine the nearest left neighbor (smaller number) and nearest right neighbor (larger number) of a number. We can use Btree [6] for this purpose if the index is too large to fit in memory.
Here is the algorithm, stated in the TA framework. While reading the algorithm, keep in mind that a document with a lower distance score is closer to the query, and hence better in our setting.
 1.
 Form k conceptual lists, one for each query term q_{i} as follows.
For every q_{i},
create an ordered list of numbers
such that
.
(Recall that
w(q_{i},
n_{i}^{j}) is the distance between q_{i} and n_{i}^{j}.)
Associate the score
s_{i}^{j} = w(q_{i}, n_{i}^{j}) with every
document returned by index access on n_{i}^{j}.
The list L_{i} for q_{i} is now defined to consist of documents obtained from
index look up on terms
sorted
in ascending value of score (lowest score first).
Note that these lists are not physically materialized, but the next()
operation on these lists is welldefined and can be efficiently
implemented using the index access described above.
 2.
 Do a roundrobin access to each of the k sorted lists L_{i}.
As a document D is seen in some list, do a database access for this
document and match it with the query using
the algorithm from Section 4.1.
The distance score returned by the matching algorithm
gives the overall score of the document.
 3.
 Let n_{i}' be the number in the index that we last looked at for
query term q_{i}.
Define the threshold value
to be the distance
from
Eq. 9. Halt when t documents have been seen whose
distance from Q is less than or equal to .
At this point, for any document that has not been seen in the index, the closest number to each query term q_{i} must be at least as far from q_{i} as n_{i}', and hence the distance between the document and the query must be at least as high as .
Note that unlike the original threshold algorithm, the s_{i} scores in the adaptation are lower bounds on the distance, not necessarily the actual distance. In other words, when we match a document D to a query, the number that ends up being matched with q_{i} may be further away from q_{i} than indicated by the score for D in the sorted list for q_{i}. The reason is that during matching, a number in D can only match one query term, but we do not track this constraint during index access (to avoid the bookkeeping overhead). Thus if a single number in D is the closest number to two different query terms, one of the two scores for D will be a lower bound for the actual distance. This does not affect the correctness of the halting criteria in step 3, since the threshold value is a lower bound on the distance of a document that we have not yet seen, and we stop when we have seen t documents whose distance is lower than .
5. Using Attribute Names and Units
5.1 Attribute Names
We now describe how to use hints about attribute names to aid matching.
Let H_{i} denote the set of attribute names associated with the
number n_{i} in a document. As before, let A_{i} denote the set of
attribute names associated with q_{i} in a query. We extend the
distance function from Eq. 9 to incorporate
hints as follows:
The parameter Bbalances the importance between the match on the numbers and the match on the hints. In general, the higher the accuracy of the hints and the higher the reflectivity of the data, the higher should be the value of B.
Recall that the function
w(q_{i},n_{j}) determines the distance
between a query number and a document number.
Analogously,
v(A_{i}, H_{j}) is a function that determines the
distance between the set of attribute names associated with a query
number and the set of attribute names associated with a document number.
We use the following distance function v in our experiments:
(11) 
This function penalizes a match only if q_{i} and n_{j} are likely to belong to different attributes. Hence if any of the attribute names in the query matches any of the hints, or if there is no attribute name specified in the query, the distance is zero. Otherwise, the distance is 1.
Our techniques are independent of the specific form of the function v. A more sophisticated function form tuned to the specific data extractor being used (e.g., by incorporating belief values generated by the data extractor) may yield better results. However, as we will see, our experiments indicate that even this simple function can be quite effective.
Determining the weight B A good value for
B is important for successfully using hints. Suppose the
website provides enough summary with each answer that the user is
likely to click on relevant answers. By tracking
user clicks, we can get a set of queries for which we know the true
answers with high probability. Treat this set as a tune set and
evaluate the performance of the matching algorithm for different values
of B. Then pick the value that gives the highest accuracy.
If a set of values give similar accuracy, pick the median value in the set.
Note that the best value of B is likely to be different for
different query sizes, and hence we will need a tune set per query size.
Constructing the Bipartite Graph
As before, we map the matching problem to weighted bipartite graph
matching. The only difference is that we now assign
as the weight of the edge
(q_{i},n_{j}).
Limiting the Set of Matches
The algorithm proposed in Section 4.2 can be used as is
even in the presence of hints, since we only need the s_{i}^{j} score to
be a lower bound on the true distance. That is, we can use
s_{i}^{j} =
w(q_{i}, n_{i}^{j}) to create the sorted list L_{i} for q_{i},
ignoring the match on attribute names.
However, the algorithm will be considerably more efficient if we modify it as follows. First, create an index to support an additional type of access, hint index access: given a number n_{i} together with an attribute name a_{i}, return the set of documents in which n_{i} is present and the set of hints for n_{i} includes a_{i}. The original index can be (conceptually) treated as a special case of this index, where the attribute name .
Now, in Step 1, for each query term q_{i}, create an ordered list and associate the score with the entry , such that . We can do this efficiently by using hint index access for each attribute name in the set of hints A_{i} associated with the query term q_{i}, and also for the empty attribute name . Step 2 does not change. In Step 3, the only change is that we now define the threshold value to be from Eq 10, where is the entry in the index that we last looked at for query term q_{i}.
5.2 Units
If the unit names are available in addition to attribute names,
we extend the distance function from Eq. 10 by adding
a term
for whether the
units match:
(12) 
A^{u}_{i} denotes the set of unit names for q_{i}, H^{u}_{j} is the set of unit names for n_{j}, and u(A^{u}_{i}, H^{u}_{j}) is the distance function between the two sets of unit names. The function u(A^{u}_{i}, H^{u}_{j}) is defined in a manner similar to v(A_{i}, H_{j}). The parameter B^{u} is analogous to B and can be similarly computed.
An additional complication arises due to different units (e.g., MHz and GHz) being assigned to the values of the same attribute. In some cases, this problem can be approached by converting the corresponding quantities into a `standard' form.
6. Experiments
In this section, we report the results of experiments we performed to study the accuracy and performance characteristics of the proposed techniques. We used synthetic as well as real datasets in this study. Synthetic datasets are amenable to controlled experiments, while real datasets are good as a sanity check.
6.1 Accuracy Metric
Given a database in which the correct correspondences between the numbers and the attribute names are known, a query Q consisting of numbers and full information about the attribute names corresponding to those numbers, we could generate the ``perfect'' answer of top t matching documents for a given distance function F. If the information about the attribute name is not available or only partially available in the form of hints, we would generate a different set of top tmatching documents .
Precision(t) is defined to be the the percentage of documents in
M_{P}(t) that are also present in M(t):
We set t to 10 in our experiments, corresponding to the number of hits typically displayed on the first page by search engines, and refer to Precision(10) as simply Precision.
6.2 Datasets
6.2.1 Synthetic Data
A document is modeled as a set of m numbers, corresponding to mknown attributes.
We generated three types of synthetic data:
 Independent: Each attribute is independent of the other
attributes.
 Correlated: Attribute a_{i+1} is correlated
with attribute a_{i}.
 Clustered: Data is clustered around a few cluster centers.
Figure 5 shows the pseudocode for generating the three types of synthetic data. We use to control the amount of reflectivity. The value of s[j] is initially set to , where R is a parameter that controls to the amount of overlap between the range of values of various attributes, which in turn influences reflectivity. If we wanted to generate a dataset with almost no overlap between any of the attributes, R would be set to a high value, e.g., 10. On the other hand, if we wanted very high overlap between the attributes, R would be set to 0. In our experiments, we manually tuned R such that the 1dimensional nonreflectivity is roughly 80% for each of the three datasets (with the default number of attributes). The values in are then randomly permuted, so that for the correlated dataset there is no connection between which attributes are correlated and which attributes are adjacent in terms of values.
// N: number of documents // m: number of attributes // D_{i}: document // D_{i}(j): value of the attribute j in document D_{i} // R: controls the amount of overlap // s[ ]: s[j] set to and then s[ ] is permuted // Gauss(): Gaussian with and 
Figure 6 shows the settings used in the synthetic data experiments. The Range column gives the range of values used in various experiments. The Default column provides the default value of the corresponding parameter. Query size refers to the number of terms in the query.
To generate a query of size k, we first pick a document at random. Next, for Independent and Clustered, we randomly pick kattributes from the document, and for Correlated, we randomly pick a set of k consecutive attributes. We then remove this document from the dataset, and compute the precision. (If we kept the document in the dataset, our scores would be artificially boosted.) We average the results over 1000 queries for each combination of parameters and query size. We use L_{1} to compute distance between a query and a document in all our experiments.
6.2.2 Real Data
Figure 7 gives information about the nine real datasets used in our experiments. The first four datasets came from the Pangea data on electronic components for different categories of electronic parts. The last five datasets are from the University of CaliforniaIrvine Repository of Machine Learning. A record here corresponds to a document.
The query generation procedure used with the real data is identical to the procedure we described for the Independent and Clustered data. All results are the average of 1000 queries.
6.3 Relationship of Precision to Reflectivity
Figure 8(a) shows the precision on the three synthetic datasets as the query size is increased. We compute answers to the same query under two settings: (i) attribute names are known for both data and query, and (ii) attribute names are ignored in data as well as query. We then use Eq. 13 to compute precision. Figure 8(b) plots nonreflectivity versus subspace dimensionality. Nonreflectivity is computed assuming attribute names are known in the data. Both precision and nonreflectivity have been plotted as a percentage. These figures show that precision at a given query size closely tracks nonreflectivity for the corresponding subspace dimensionality.
As anticipated, both Clustered and Correlated gave higher precision than Independent. Clustered gave higher precision than Correlated for higher query dimensions. To understand the behavior of Clustered with increasing subspace dimensionality, recall our example from Figure 2(b). This dataset has very low reflectivity in 2 dimensions, but high reflectivity in one dimension. Similarly, in our synthetic data, the clusters gradually merge as we drop dimensions, leading to an increase in reflectivity for lower dimensions. By the time we reach 2 dimensions, the clusters have completely disappeared and Clustered starts behaving like Correlated and Independent.
Figure 9(a) shows the precision as a function of the number of documents. Figure 10(a) shows the precision as a function of the number of data attributes. We again note from corresponding figures that the precision closely tracks nonreflectivity.
(a) Precision  (b) Reflectivity 
(a) Precision  (b) Reflectivity 
6.4 Effectiveness of Indexing
We now study the effectiveness of indexing in limiting the set of documents for which we have to compute the distance for a query. All our experiments were run on a 933 MHz Pentium III with 512 MB of memory. The code was written in Java and run with Sun JDK 1.3 Hotspot Server. The execution times will be faster with C++, but the relative impact of various factors is likely to be the same. Both index and documents were kept completely in memory, limiting us to around 800,000 documents.
Figure 11 shows the execution times with and without indexing for the Independent, Correlated and Clustered datasets. The time without indexing was almost identical for the three datasets, and hence we only plot one curve, ``Scan'' for the no indexing case.
(a) Varying query size (#documents = 10K, #attributes = 20)  (b) Varying number of documents (Query Size = 5, #attributes = 20) 
(c) Varying number of attributes (Query Size = 5, #documents = 10K)  
Figure 11(a) shows the execution time results for different query sizes for the three synthetic datasets. The index is quite effective for smaller query sizes, which we expect to be the norm. For larger query sizes, the dimensionality curse begins to catch up and using the index only does a little better than a full scan.
Figure 11(b) shows the scalability of the indexing as we increase the number of documents from 1000 to almost 1 million. At query size 5, the number of documents matched goes up by a factor of about 250 times as we increase the number of documents by 800 times, and the number of index entries scanned goes up around 200 times. Hence while the fraction of the index scanned and the percentage of documents checked both decrease slightly, the absolute time goes up almost linearly with the number of documents. At 800,000 documents, we take slightly more than 1 second for query size 5 and around 0.03 seconds for query size 2.
Figure 11(c) shows that the execution time remains flat as we increase the number of attributes. In our synthetic datasets, adding new attributes does not change the average amount of overlap between attributes. Hence the number of documents matched or index entries scanned is not affected by adding more attributes. However, matching takes a little longer, and hence the overall time goes up slightly.
Figure 12 shows results on three of the real datasets. For DRAM, there were often a large number of documents at a distance of zero from the query Q for smaller query sizes. Hence whether we stop as soon as we get 10 documents with zero distance, or whether we get all documents at zero distance (and then present a random sample), makes a dramatic difference in the running time. The ``All Closest'' line shows the former case, and the ``Any 10'' line the latter. For the other two datasets, this is not a significant issue, and hence these two lines are very close to each other. Again, we can conclude that indexing is quite effective for smaller query sizes.
6.5 Precision on Real Data
Figure 13 shows the precision and nonreflectivity as a percentage in the same graph for the nine real datasets. The xaxis for precision and nonreflectivity is query size and subspace dimensionality respectively. We again see that nonreflectivity can provide a good estimate of the expected precision. We also see that real datasets can be quite nonreflective and hence our techniques can be effectively used on them.
(a) DRAM  (b) LCD 
(c) Microprocessor  (d) Transistor 
(e) Automobile  (f) Credit 
(g) Glass  (h) Housing 
(i) Wine  
Effect of Clustering and Correlation
In Section 3, we stated our intuition that clustering
and correlation would typically increase nonreflectivity. To verify
this conjecture, we study
the effect of destroying any clustering or correlation effects in a
dataset by randomly permuting the values for every attribute.
(We permute the values of an attribute across documents, not across
different attributes.)
If the attributes were originally independent, this permutation will not affect
reflectivity, since permutation does not affect the distribution of
values for
each attribute. Hence the difference in nonreflectivity between the
original data and the randomized data can be attributed to clustering
and correlation. The line ``Randomized NonRefl.'' shows the
nonreflectivity on the randomized datasets. For all but the Credit
dataset, randomized nonreflectivity is substantially lower than
nonreflectivity.
6.6 Using Attribute Names as Hints
We generated data for this set of experiments by starting with the real datasets and then augmenting them with hints. Our procedure for generating hints takes two parameters: AvgNumHints and ProbTrue. AvgNumHints is the average number of hints per data value. Each data value is assigned at least one hint, hence the minimum value of AvgNumHints is 1. If AvgNumHints is greater than 1, then the number of hints for a specific data value is determined using a Poisson distribution. ProbTrue is the probability that the set of hints for a data value includes the true attribute name. Figure 14 gives the pseudocode for generating hints for a given data value. In general, increasing AvgNumHints should result in increasing ProbTrue. However, since the exact relationship is data and data extractor dependent, in any given experiment, we fix one of the values and vary the other. The query consists of numbers together with the corresponding true attribute name.
// ProbTrue: Prob. of true attribute name being in the set of hints 
We synthesized nine datasets from the real datasets described earlier, and experimented with all of them. However, we show the graphs for only three representative datasets: DRAM, Auto and Wine. DRAM and Auto have high nonreflectivity, and Wine lower nonreflectivity.
Weighting the Match on Hints
Figure 15 shows the effect of changing the
weight given to the match on the attribute names (the parameter
B in Eq. 10). AvgNumHints was set to 1, so
that every data value had exactly one hint assigned to it.
The different curves show precision for different values of ProbTrue.
If the hints are accurate (high value of ProbTrue), then precision
rapidly goes up with increasing B. For DRAM, the precision goes up
for values of B much smaller than 0.01, and hence
Figure 15(a) does not show this increase.
For queries on DRAM, there are often different attributes with the same
numeric value as the query value, and
even a small value of B is sufficient to pick between them.
If the hints are not very accurate, precision goes down with
increasing B. One should not conclude that B should be
if hints are accurate and 0 otherwise. For example,
in the case of wine with ProbTrue between 0.6 and 0.8, the
precision initially goes up with B and then drops back down.
Therefore it is important to choose a good value of B.
In the rest of the experiments, we assume we have a tune set of 100 queries (per query size) for which we know the best match. We compute Precision over this tune set for different values of B(specifically, we used { 0.01, 0.03, 0.1, 0.3, 1, 3, 10 }). We then choose the value that gives the highest precision. As we mentioned in Section 5, this method can be used as long as the website provides enough summary with each answer so that the user is likely to click on relevant answers.
(a) DRAM (query size 2)  (b) Auto (query size 2) 
(c) Wine (query size 5)  
Effectiveness of Hints
Figure 16 shows the gains in precision
for different values of ProbTrue when AvgNumHints is set to 1.
Notice that for datasets with high nonreflectivity (DRAM and Auto), the hints
have to be extremely accurate (ProbTrue > 0.9) to add significant value.
However, for datasets with low nonreflectivity (Wine), even hints with
lower levels of accuracy can increase precision.
Number of Hints
In the previous set of experiments, we found that ProbTrue (the
probability that the set of hints will include the true attribute name)
had to be
quite high to increase precision. In many domains, such high values of
ProbTrue may not be achievable without
having multiple hints per attribute, where only one of the hints is correct.
Figure 17 shows precision for different values of
the average number of hints (AvgNumHints). For each line, we have fixed
the value of ProbTrue.
As expected, for a fixed value of ProbTrue, precision decreases as the number of hints increases. However, we expect ProbTrue to increase with the number of hints. Consider Figure 17(c), and assume we have a data extractor that gives ProbTrue of 0.7 when AvgNumHints = 1, ProbTrue of 0.9 when AvgNumHints = 1.5, and ProbTrue of 0.99 when AvgNumHints 3. In this scenario, choosing AvgNumHints = 3 will result in the highest precision. Thus, having multiple hints can improve precision if there is an accompanying increase in ProbTrue.
Estimating Reflectivity using Hints
Finally, we explore how well we can estimate nonreflectivity
when we do not know true attributes but only have
hints about them. We generate data with AvgNumHints = 1 and different
values of
ProbTrue, treat each hint as if it were the true attribute name, and
compute reflectivity. Figure 18 shows the results.
The values on the yaxis, with ProbTrue = 1, are the true values of
nonreflectivity. The nonreflectivity estimates are
lower with hints than with true attribute names. However, for small
subspace dimensions, the dropoff is sufficiently gradual that the
estimate is still useful when the hints are reasonably accurate. For
larger subspace dimensions, the dropoff is sometimes quite steep. Hence
if the estimate for nonreflectivity has a low value, we cannot be sure
that the true value of nonreflectivity is indeed low or whether it is
an artifact of the inaccuracy of hints.
But if we get a high value, we
can be confident that the true value of nonreflectivity is also high.
7. Conclusions
Many web text repositories
consist of documents in which it is difficult to establish exact
correspondences between attribute names and their numeric values.
To enable nearestneighbor queries over such repositories, we explored a
rather audacious approach of using only sets of numbers in the search,
ignoring any attribute names. We showed that matching a document with a
query in this setting corresponds to the weighted bipartite graph
matching problem. We also showed how to limit the
number of documents that we have to match with a given query.
We identified a data property, called reflectivity, that can tell a priori how well our approach will work against a dataset. High nonreflectivity in data assures that our techniques will have high precision. Our experiments showed that real datasets can exhibit high nonreflectivity and our techniques yielded high precision against these datasets. Using synthetic data, we also showed the scalability and resilience of our techniques in terms of the number of documents, number of attributes in the data, size of queries, and types of data.
We also showed how we can use imprecise attribute names as hints to improve precision. Hints are particularly helpful when the data has low nonreflectivity. However, for datasets with high nonreflectivity, hints have to be extremely accurate to improve precision beyond what we get by matching numbers.
In the future, we plan to extend this work in three directions. First, we would like to handle queries in which value ranges are provided. Second, some extraction procedures associate confidence values with the attribute names assigned to a number. We wish to explore how to take advantage of this additional information. Finally, we would like to better understand the interaction between correlated and clustered data and reflectivity.
Acknowledgment We wish to thank Ron Fagin for discussions on the threshold algorithm.
Bibliography
 1

S. Berchtold, C. Bohm, D. A. Keim, F. Krebs, and H.P. Kriegel.
On optimizing nearest neighbor queries in highdimensional data spaces.
In Proc. of the 8th International Conference on Database Theory, pages 435449, London, January 2001.  2

A. Broder and M. Henzinger.
Information retrieval on the web: Tools and algorithmic issues.
In Foundations of Computer Science, Invited Tutorial, 1998.  3

S. Chakrabarti.
Data mining for hypertext: A tutorial survey.
SIGKDD Explorations: Newsletter of the Special Interest Group on Knowledge Discovery & Data Mining, 1, 2000.  4

B. Cherkassky, A. Goldberg, P. Martin, J. Setubal, and J. Stolfi.
Augment or push? A computational study of bipartite matching and unit capacity flow algorithms.
Technical Report 97127, NEC Research Institute, August 1997.  5

J. Cho and S. Rajagopalan.
A fast regular expression indexing engine.
In Proc. of the Int'l Conference on Data Engineering, San Jose, California, Feb. 2002.  6

D. Comer.
The ubiquitous Btree.
ACM Computing Surveys, 11(2):121138, June 1979.  7

A. Crespo, J. Jannink, E. Neuhold, M. Rys, and R. Studer.
A survey of semiautomatic extraction and transformation.
http://wwwdb.stanford.edu/ crespo/publications/.  8

U. Dayal and H.Y. Hwang.
View definition and generalization for database integration in a multidatabase system.
IEEE Trans. Software Eng., 10(6):628645, 1984.  9

S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman.
Indexing by latent semantic analysis.
Journal of the American Society for Information Science, 41(6):391407, 1990.  10

C. Dwork, M. Naor, R. Kumar, and D. Sivakumar.
Rank aggregation methods for the web.
In Proc. of the 10th WorldWide Web Conference (WWW10), pages 613622, Hong Kong, May 2001.  11

R. Fagin.
Combining fuzzy information from multiple systems (extended abstract).
In Symposium on Principles of Database Systems, 1996.  12

R. Fagin, A. Lotem, and M. Naor.
Optimal aggregation algorithms for middleware.
In Symposium on Principles of Database Systems, 2001.  13

T. Feder and R. Motwani.
Clique partitions, graph compression and speedingup algorithms.
Journal of Computer and System Sciences, 51:261272, 1995.  14

U. Guntzer, W.T. Balke, and W. Kiessling.
Optimizing multifeature queries for image databases.
In Proc. of the 26th Int'l Conf. on Very Large Databases, Cairo, Egypt, 2000.  15

P. Indyk and R. Motwani.
Approximate nearest neighbors: Towards removing the curse of dimensionality.
In ACM Symposium on Theory of Computing, pages 604613, 1998.  16

V. Kashyap and A. P. Sheth.
Semantic and schematic similarities between database objects: A contextbased approach.
VLDB Journal, 5(4):276304, 1996.  17

W. Kim and J. Seo.
Classifying schematic and data heterogeneity in multidatabase systems.
IEEE Computer, 24(12):1218, 1991.  18

Z. Kopal, editor.
Physics and Astronomy of the Moon.
Academic Press, 1962.  19

I. Muslea.
Extraction patterns for information extraction tasks: A survey.
In The AAAI99 Workshop on Machine Learning for Information Extraction, 1999.  20

S. Nepal and M. V. Ramakrishna.
Query processing issues in image (multimedia) databases.
In Proc. of the 15th Int'l Conf. on Data Engineering, pages 2229, 1999.  21

N. Roussopoulos, S. Kelley, and F. Vincent.
Nearest neighbor queries.
In Proc. of the 1995 ACM SIGMOD Int'l Conf. on Management of Data, pages 7179, 1995.  22

G. Salton.
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer.
AddisonWesley, New York, 1989.
Footnotes
 ... number.^{1}
 The specific extraction techniques used are not of concern in this paper. One could, for example, use a regular expression based extractor [5].
 ...}.^{2}
 For a given source vertex, we only need to create edges to the k closest targets, since the other k1 source vertices can match at most k1 targets.
Ramakrishnan Srikant
20020318