On Integrating Catalogs
IBM Almaden Research Center
650 Harry Road
San Jose, CA 95120, USA
Copyright is held by the author/owner(s).
WWW10, May 1-5, 2001, Hong Kong.
Note: We strongly recommend printing the PDF version of this paper.
Keywords: Classification, Categorization, Data Mining, Catalog Integration, Web Portals, Web Marketplaces
Imagine you are a marketplace for electronic components. Your catalog features nearly ten million parts categorized into 5000 categories. Noticing your success, a major distributor wants to join your marketplace. This distributor's catalog contains nearly a million parts categorized into 2000 categories. Your problem is how to quickly integrate this distributor's catalog into your catalog.
This problem is pervasive on the web, given many websites are aggregators of information from various sources. B2C shops like Amazon need to integrate catalogs from multiple vendors. B2B portals, Chipcenter and Questlink, each with a large catalog of their own, recently merged to form eChips. Information portals like Yahoo! and Google categorize documents into categories. One can easily conceive a web service that combines the two portals. Many corporate portals are now merging intra-company and external information into a uniform categorization.
This paper presents a new technique to help automate the task of catalog integration. Let us call your marketplace MrCurrent and the distributor who wants to join you MrNew. A straightforward approach to attacking the catalog integration problem would be to formulate it as a classification problem [Mit97]. Take the MrCurrent catalog, treat each category in the catalog as a class, and the information about the products belonging to each category as training examples for the corresponding class. We thus have a well-posed classification problem and a training set and we can build a predictive model for classifying MrNew's products into MrCurrent's categories.
Notice, however, in this straightforward approach, we completely ignored MrNew's categorization. On the other hand, MrNew's categorization contains valuable implicit information about product similarity. Suppose the classifier's prediction is such that 98% of the products belonging to some category in MrNew's catalog fall in one category in MrCurrent's catalog and 2% in a different category. Those 2% predictions are quite possibly errors, but it will also be a mistake to take a winner-takes-all attitude and assume that those 2% are necessarily errors.
Our key contribution is how to incorporate the implicit information contained in MrNew's catalog into the classification process. We show that this additional information can substantially boost classification accuracy.
We now formally define the catalog integration problem we are solving.
A document d is an object consisting of i) a set of words, and/or ii) a set of attribute value pairs. By including attribute value pairs, we are able to cover both text documents as well as product descriptions.
A catalog is a partitioning of a set of documents into a set of categories1. We are given two catalogs:
- A master catalog M with a set of categories and a set of documents in each category.
- A source catalog N with a set of categories and another set of documents.
We optionally identify documents in N that do not fit well in M and give the user the option of adding these as new categories in M.
An assumption underlying our model is that the categorizations used by catalogs M and N are homogenous and have significant overlap. It is possible that the categorizations used by M and N may be completely orthogonal to each other. For instance, consider a corpus of documents describing businesses. The categorization in M is by business type, whereas the categorization in N is by geographical location. In such cases, the implicit information in N will not help us better categorize the documents into M.
Vocabulary changes are a problem for classification in general. For instance, it was observed in [Lar99] that patents about similar inventions can contain very different terminology. If the vocabulary in N is quite different from M, the classification accuracy will certainly be affected. However, this problem is orthogonal to the idea of using the similarity information in N.
Our model flattens the catalog hierarchy and treats it as a set of categories. Past studies [DC00] [CDAR97] have shown that exploiting the hierarchical structure can lead to better classification results than using the flattened structure. Our enhancements for using the information in N can be easily incorporated in a Naive Bayes hierarchical classifier, such as one used in [CDAR97]. Incorporation of these enhancements into other classification schemes, such as the SVM classifier used in [DC00], requires further work.
Another issue related to hierarchies is that the hierarchy in M may be more detailed than N or vice-versa. If M is more detailed than N, our technique can still be helpful. For example, if N has a category ``Cars'', while M has ``Sports Cars'' and ``Sedans'', our technique will not help distinguish between ``Sports Cars'' and ``Sedans''. However, it will help distinguish between these two categories and say ``Trucks''. On the other hand, if N is more detailed than M, the detailed categories in N can be first merged into a super category and our technique can then be applied. (While our technique would be effective even if applied without merging the detailed categories, we show in Section 5.2 that having more documents in the categories in N leads to better results.)
Inducing classification model for a set of categories given examples of objects for each category is a much studied topic in the statistics, machine learning, and data mining literature. See, for instance, [Mit97] for a comprehensive review of various classification techniques. Naive-Bayes classifiers [Goo65] are competitive with other techniques in accuracy [CDAR97] [LR94] [Lan95] [PB97] [MN98]. They are also fast: building the model requires only a single pass over the documents and they quickly classify new documents. Our proposed solution is also Bayesian.
The observation that the classification techniques can be used to assign documents to a hierarchy has been previously made in connection with folder systems. Proposals on the development of classification models for the purpose of routing e-mail include [ABS00] [Coh96] [SDHH98] [SK99]. Other systems provide agents that assist e-mail users by predicting an action the user is likely to take [Mae94] [PE97]. SONIA [SYB98] uses agglomerative text clustering to organize the results of queries to networked information sources, and Naive Bayes classification to organize new documents within an existing categorization scheme. Text classification has also been applied in other domains, e.g. [DPBA99] showed how well SEC (Security Exchange Commission) filings can be classified into SIC categories. None of these systems address the task of merging hierarchies.
The Athena system [ABS00] includes the facility of reorganizing a folder hierarchy into a new hierarchy. The user provides examples of documents for every node of the new hierarchy, which is used as the training set for learning the classification model for new hierarchy. The documents from old hierarchy are then routed to the new hierarchy using this model. But no information from the old hierarchy is used in either building the model or routing the documents.
2, we review Naive Bayes classification and give the basic algorithm that applies this technique in a straightforward manner to merge documents from a source catalog into an existing catalog. In Section 3, we present our enhanced algorithm that uses the implicit information in the source catalog to improve the accuracy of catalog integration. We present an analysis in Section 4 that shows why using implicit information is a win. We present an empirical evaluation in Section 5 that gives improvements in accuracy realized in our experiments. We conclude with a summary in Section 6.
We start with a quick review of Naive Bayes classification and then give the basic algorithm that applies this technique for merging catalogs in a straightforward manner.
The Naive Bayes classifier estimates the posterior probability of
category Ci given a document d via Bayes' rule
We can ignore since it is the same for all categories and we need only the relative probability of the categories to determine d's category assignment. is estimated by
which is easy to compute.
We treat as an input in our enhanced algorithm. We review here how to compute only for the case of text, and refer readers to [Mit97] for extensions to the case when the document also contains a set of attribute-value pairs.2
To estimate the first term on the right hand side of Eq. 3,
assume that the words in d are independent of each other. We get
where t represents the words (tokens). To estimate , we compute the frequency of occurrence of every word appearing in any of the textual descriptions of the set of documents in category Ci. Let n(Ci,t) be the number of occurrences of word t in documents in category Ci(counting multiple occurrences), and the total number of words in documents in category Ci. Then the maximum likelihood estimate for is simply n(Ci,t) / n(Ci). However, using this estimate would give a probability of zero for any word that does not occur in any of the documents in the category, and thus result in being zero for any document dthat contained a word not present in category Ci. Following [ABS00], we address this problem by using Lidstone's law of succession to smooth the maximum likelihood estimate. For , we estimate to be
where |V| is the size of the vocabulary (i.e., the number of distinct words in the textual descriptions in all of the documents). The above estimate is a linear interpolation of the maximum likelihood estimate n(Ci,t)/n(Ci) and the uniform prior 1/|V|. The optimal value of is computed by using a randomly selected subset of documents in M as a validation set, i.e., we build the classification model using the rest of M, and compute the accuracy of the model for different values of on the validation set.
Figure 1 presents the basic method that uses the standard Naive Bayes technique. We first build a classification model using the set of documents already in M. This classification model is then used to place documents from N into M.
Note that depending on policy parameters:
- A document d may be assigned to more than one category (say Ciand Cj) if and both have high values.
- If for some document d, the value of is low for all the categories, d may be kept aside for manual classification, possibly into a new category of M.
- If for some category S in N, a large fraction of the documents satisfy the previous condition, S may be flagged as a candidate for becoming a new category of M.
Our proposed method uses the similarity information implicit in the categorization of documents in N to build more accurate classification models. The intuition is that if two documents belong to the same category in N, they are more likely to belong to the same category in M.
Let S denote a category in N. We can extend Bayes rule
from Eq. 1 to incorporate the implicit information in
S. We will use the standard notation
to refer to
We also use
to refer to
to refer to
The posterior probability
of category Ci
in M given a document d belonging to category S in N is computed as:
Eq. 5 is similar to Eq. 1, except that we have instead of . (We also have instead of in the denominator, but since this term is the same for all classes, it does not affect relative probabilities.)
we first classify the documents using
the basic algorithm, then use the categories of the
documents in S to compute the estimate. A simple estimate could be:
However, while this equation is identical in form to the estimate for (Eq. 2), that equation was based on real frequencies, while this equation is based on estimated frequencies. The accuracy of the estimate depends on the accuracy of the classifier.
As an illustration, consider a scenario where we know that the source
catalog categories are identical to the master catalog categories.
With a perfect classifier, the above estimate will be 1 for the true
category and 0 for all other categories. With a classifier that is 90%
accurate, the estimate will be .9 for the true category, and perhaps .1/kfor k other categories assuming the errors are evenly split among
them. In this case, we would like
to be 1 for
the category with the most number of documents, and 0 for the other
categories, i.e., use the majority rule. More generally, we can use an
that reflects the similarity between the
categorization of the two catalogs to decide the amount of weight to
give to the implicit information. We would also like to smooth our
estimate using the statistics for
from M, and
have the property that for w = 0, our enhanced classifier defaults to
the standard classifier. A formula that satisfies these goals is:
where |Ci| is the number of documents in category Ci in M. For w = 0, , where Bi is 1 if at least one of the documents in Swas predicted to be in category Ci, and 0 otherwise. Notice that even though the absolute probabilities are different, the relative probabilities among the set of classes that the standard classifier predicted for any of the documents in S is the same. Hence the prediction of the enhanced classifier will be the same as the prediction of the standard classifier when w = 0.
Figure 2 describes the enhanced algorithm, for a given weight w.
1. For each category Ci in M, compute and (Eqs. 2 and 4, respectively). 2. For each category S in N: (a) For each document d in S: (i) Compute for each category Ci in M using the basic algorithm (Figure 1). (ii) Tentatively assign d to the category with the highest value for . (b) Use the results of Step 2(a)(ii) and Eq. 6 to compute . (c) Re-classify each document in S using instead of in the classification model (Eq. 5).
Before describing the method for selecting weight w, we first motivate the need for selecting a good value for w.
Example Consider a master catalog M in which there are separate categories for ``Digital Cameras'' and ``Computer Peripherals''. When we integrate products from the source catalog N, let the basic algorithm come up with the following probabilities for the five products in one of the categories in N.
Based on these probabilities, four out of the five products belong to Camera and one to Peripherals. In the master catalog, let there be 10 products each of these two categories, and none in any other category (to simplify the example). With a weight of 1, = 4*10/(4*10+1*10) = 0.8 and = 1*10/(4*10 +1*10) = 0.2. After incorporating these probabilities (and normalizing), the enhanced algorithm would assign:
Peripherals Camera P1 .1 .9 P2 .2 .8 P3 .2 .8 P4 .2 .8 P5 .9 .1
With a weight of 2, = 16*10/(16*10+1*10) = 0.94 and = 1*10/(16*10+1*10) = 0.06. This would result in:
Peripherals Camera P1 .03 .97 P2 .06 .94 P3 .06 .94 P4 .06 .94 P5 .69 .31
Thus the classification of P5 does not change with a weight of 1, but switches to the majority category with a weight of 2.
Peripherals Camera P1 .01 .99 P2 .02 .98 P3 .02 .98 P4 .02 .98 P5 .36 .64
To determine a good value for the weight, we need a tune set of documents in N for which we know the correct categorization with respect to M. If there are some common documents between M and N, we can use these common documents as the tune set. If there are no common documents between the two catalogs, the creation of the tune set requires user interaction. We select a random subset of the documents in N, and present these to the user to get their categorization in M.
We first make one pass through step 2(a) in Figure 2 for all the documents in N, allowing us to compute for a given weight w. Next, for each document in the tune set, for a set of values for w, we go through steps 2(b) and 2(c) in Figure 2 and determine the values of w for which the document is correctly classified. We typically use an exponentially increasing series for the set of possible values of w, e.g. (0, 1, 3, 10, 30, 100, 300, 1000). We then select the weight that gives the highest accuracy on the documents in the tune set. If a set of weights give the same accuracy, we break the tie by choosing the smallest weight, and thus not overweight the similarity implied by the hierarchy N.
We can reduce the number of documents the user has to inspect by making two passes. After selecting a random subset of documents in N, we make a first pass through Step 2 for those documents, and discard those which have the same categorization for all the weights. Since the categorization does not change, knowing the true category for these documents will not help us choose the weight. We then present the remaining documents to the user, and choose the weight that gives the highest accuracy on these documents. We empirically found in Section 5.2 that by following this strategy, feedback from the user on just 5 to 10 documents was sufficient to get a near-optimal value for the weight.
Depending on the cost of getting a tune set and the size of the catalog, we either choose a global weight for the entire catalog, or tune weights differently for different sections of the catalog.
We first study the behavior of the enhanced algorithm with respect to weight w. We then show that with a good choice for the value of w, the enhanced algorithm will do no worse, and can do substantially better than the basic algorithm.
Consider the documents belonging to a category S in N.
Our first theorem shows that as we increase the weight w, the
predicted category of a document in S will either stay the same or
switch from a less frequent
category to a more frequent category, where frequency of a category is
the number of documents in S predicted to be in that category. Denote
Let represent the categories ordered by their frequency in the predicted categories of the documents in S, i.e., , Figure 3 shows the movement of documents with increasing weight. We can split the documents into three classes:
- Benefit: Those to the right of the X's that can move to the
- At Risk: Those currently in the X's that might move to the
left of the X's.
- Lost Cause: Those to the left of the X's.
The catch is that the optimum value of the weight for which the enhanced algorithm achieves highest accuracy is data dependent. The method using a tune set described in the previous section attempts to select a good value for the weight, but there is no guarantee of success. Our experimental results in the next section show that we were able to get substantial accuracy improvements using this method.
We present experimental results using two types of datasets:
Synthetic Catalogs Start with a real-world catalog and consider it to be the master catalog M. Derive source catalogs N from M using various distributions. Consider a specific category Cm in M. Some of the documents from Cmare assigned to the corresponding category in N, while some are spread over other categories. Use a distribution to determine how many documents are distributed over how many categories. For example, a simple 80-20 distribution will send 80% of the documents to the corresponding category and 20% to some other category. Different distributions result in different degrees of overlap between M and N. We use a variety ranging from a ``perfect'' match between the two catalogs, to a ``Gaussian'' distribution where there is only a moderate amount of similarity. We generate synthetic catalogs from three master catalogs: a collection of news articles from Reuters, the Pangea electronics part descriptions dataset, and a slice of the Google hierarchy.
Now consider a specific document d from Cm in M that the synthetic data generation assigns to the category Sn in N. While integrating N into M, if d is assigned a category other than Cm, we count it as a classification error; otherwise the assignment is correct.
Real Catalogs Start with two real-world catalogs that have some common documents. Treat the first catalog minus the common documents as the master catalog M. Remove from the second catalog all documents except the common ones. Treat the remaining documents in the second catalog as the source catalog N. We thus know for each document in N the correct classification in M. The goal for the catalog integration algorithm now is to successfully predict for each document in N its correct category in M. We do these experiments using Google and Yahoo! categories.
Experiments Our goal is to understand the following performance characteristics:
- Accuracy advantage of the enhanced algorithm over the basic algorithm.
- The effect of weight on the accuracy of the enhanced algorithm.
- The size of tune set needed for the enhanced algorithm.
- The effect of the number of documents per category in the source catalog on the classification accuracy.
We use the classification
accuracy as the metric for measuring the success of our approach.
Accuracy is defined as
In the rest of this section, we describe in detail the datasets and the experimental results. Section 5.1 discusses the synthetic datasets and Section 5.2 the results with these datasets. Section 5.3 discusses the real datasets, and Section 5.4 reports results with these datasets.
Master Catalogs We used the following master catalogs for generating synthetic catalogs:
- Reuters: This is a collection of news articles, and a popular benchmark in the information retrieval community. We used the version of Distribution 1.0 of Reuters-21578 in which each document is assigned a single category (available from http://www.research.att.com/ lewis).
- Pangea: This is a dataset of electronic component descriptions used in building the experimental B2B Pangea portal at IBM Almaden.
- Google.Outdoors: This is the set of web pages in the ``Recreation/Outdoors'' category in Google (available from http:// directory.google.com/Top/Recreation/Outdoors/). For each category in ``Outdoors'', we got the links both on the category page, and from each of the subcategories, and merged the set of links.
Figure 4 shows the number of categories and total number of documents for these catalogs.
Figure 4: Dataset Characteristics: Reuters, Pangea & Google.Outdoors
Target Distributions for Synthetic Catalogs
Our synthetic catalogs have as many categories as the master catalog. We spread documents from a certain category in the master catalog to multiple categories in the synthetic catalog. Figure 5 shows the various target distributions we used for spreading documents. For the Perfect distribution, each category in the synthetic catalog will have documents from a single category in the master catalog. On the other hand, for GaussianB distribution, documents from a category in the master catalog will be spread over 7 categories in the synthetic catalog, and each category in the synthetic catalog will include documents from 7 categories in the master catalog. The intent was to evaluate the performance of our approach to catalog merging over a wide range of similarity characteristics between the categorizations of the two catalogs.
|GaussianA||68.2, 27.2, 4.3, 0.3|
|GaussianB||38.3, 29.9, 18.4, 8.8, 3.4, 0.9, 0.3|
Figure 5: Target Distributions for Synthetic Catalogs
Figure 6 gives the details of the algorithm for generating N.
// n: number of categories in M // : cumulative frequency distribution // f0 := 0 (to simplify the description) for i := 1 to n begin foreach document d in Ci begin Let r := uniformly distributed random number in [0, 1). Find p such that . j := ; Assign d to category Sj. end end
Figure 7 shows the actual distributions for the synthetic catalogs. These are slightly different from the target distributions because of the skew in category sizes of the master catalogs.
|90-10||91.2, 8.8||90.6, 9.4||91.1, 8.9|
|80-20||82.0, 18.0||80.9, 19.1||82.6, 17.4|
|GaussianA||70.8, 25.5, 3.5, 0.2||69.2, 27.1, 3.5, 0.1||68.5, 28.0, 3.4, 0.1|
|GaussianB||51.2, 27.5, 11.8, 6.5, 2.1, 0.9||45.2, 28.4, 15.8, 8.1, 2.2, 0.3||42.2, 28.8, 18.9, 7.2, 2.5, 0.3|
8 shows that the enhanced algorithm can substantially outperform the basic algorithm. The graphs show the accuracy for each of the source catalogs as we change the weight; the ``Base'' line is the accuracy of the basic algorithm. The accuracy numbers were computed using 10-fold cross-validation, i.e., we randomly partition the data into 10 sets, and for each set, train on the remaining 9 sets, and test on this set [BFOS84]. With a ``Perfect'' second catalog, the absolute increase in accuracy (at the best weight) is 12% for Reuters, 20% for Pangea and 29% for Google; the number of errors drops by 85% for Reuters, 73% for Pangea, and 65% for Google. With a distribution like GaussianA that results in considerable mismatch between the source and master catalogs, we still get an absolute increase of 5% for Reuters, 12% for Pangea and 15% for Google; the number of errors drops by 39% for Reuters, 43% for Pangea and 33% for Google. Notice that for distributions (for generating source catalogs) that are not a good match to the main catalog, very high weights can cause the enhanced algorithm to under-perform the basic classifier. Of course, once we use a tune set, we will avoid those weights.
In general, the optimal value of weight increases with increasing similarity between the two hierarchies. The other factor is the size of the documents in the dataset: as documents become longer, the differences between for different categories becomes larger and hence a higher weight is needed to influence the results. The documents from the Google dataset are significantly longer than those from Reuters or Pangea; hence for the same distribution of the synthetic hierarchy, the optimal weight is higher for Google.
Figure 9 shows that very small tune sets are sufficient to choose the right weight. With just 5 carefully chosen examples, we are able to do almost as well as with 10 to 50 examples: there are only slight improvements beyond 10 examples. The reason why so few examples are sufficient is that changes in weights only change the classification of a small fraction (typically around 10%) of the total documents to be classified. By choosing only such documents to present for tuning, we get the same effect as we would by presenting 10 times as many random documents.10 shows the effect on accuracy as we increase the number of documents in the second catalog. The x-axis shows the weighted median category size, i.e., if the category size is 10 documents, half the documents are in categories (in the second catalog) with 10 or fewer documents, and half the documents are in categories with 10 more documents. (The median was averaged over the different test sets; hence it is not an integer.) We used a modified version of 5-fold cross-validation: the training set was always 80% of the data, but the test set ranged from 20% of the data to samples of that 20%. The weight was determined using a tune set of 10 documents (selected from the entire 20% in order not to affect the results). With small category sizes, our estimates of are likely to be off by quite a bit from the true value; hence it is not surprising that the accuracy increases with the category size. However, note that for Pangea and Google, we still get a substantial part of the accuracy boost even with median category sizes of just over 5 documents.
Figure 10: Accuracy vs. Median Size of Categories in Synthetic Catalog
To evaluate our approach using real catalogs for N, we wrapped the Google and Yahoo! websites, extracted over 100,000 links from 5 different sections of their categorizations, and retrieved the corresponding web pages.3 Incidently, we found less than 10% overlap between the links in Google and Yahoo!. Figure 11(a) shows the number of categories, number of links obtained from the website, and number of documents (after dropping the common documents and those documents our crawler could not get). When computing the set of common links, we ignored links that had multiple categories. Figure 11(b) shows similar numbers for the set of common documents. We include both the weighted median and the average documents per category, since the median can sometimes be much higher. This figure also shows the distributions of the categories. For instance, a single category in Google.Autos had, on average, 78% of documents from a single category in Yahoo.Autos, 9% from a second category in Yahoo.Autos, 5% from a third category, and so on. These distributions look similar to the Gaussian distributions we used in the synthetic catalogs, except that the tails are longer.
|(a) All Documents||(b) Common Documents|
|Google.Autos||26||8095||7351||23||223||9.7||52||78, 9, 5, 4, 3, 1, .4|
|Yahoo.Autos||59||7202||6021||36||223||6.2||49||76, 13, 5, 2, 1, .9, .9, .4, .2, .2|
|Google.Movies||40||7285||6483||32||130||4.1||8||69, 15, 6, 3, 3, 1, 1, .7, .7, .7|
|Yahoo.Movies||45||8531||7433||38||130||3.4||7||64, 20, 8, 4, 2, 1, .6, .2|
|Google.Outdoors||38||7707||7260||30||135||4.5||6||76, 16, 6, 1|
|Yahoo.Outdoors||68||8756||8162||37||135||3.6||7||80, 11, 3, 2, 1, .7, .7, .5, .2, .2|
|Google.Photography||26||3661||3233||22||148||6.7||15||67, 18, 6, 4, 2, 2, 2, .5|
|Yahoo.Photography||35||3575||3014||26||148||5.7||11||59, 17, 9, 5, 3, 2, 2, .9, .4, .4|
|Google.Software||77||20471||18864||55||189||3.4||8||69, 15, 7, 4, 3, .9, .3, .3, .3, .3|
|Yahoo.Software||46||12222||10673||37||189||5.1||7||50, 14, 8, 6, 4, 3, 2, 1, .9, .9|
Figure 11: Dataset Characteristics: Google & Yahoo! Slices (October 2000)
Figure 12 shows the performance of the algorithm on the 10 datasets given in Figure 11. We tested the accuracy of the classifiers when merging the Yahoo! categorization into the corresponding Google categorization and vice versa. The enhanced algorithm did very well even though the two catalogs were quite different: 30% fewer errors on average (14.1% difference in absolute accuracy) when classifying Yahoo! into Google, and 26% fewer errors on average (14.3% difference in absolute accuracy) when classifying Google into Yahoo!.
(a) Yahoo! to Google (Train: Google, Test: Yahoo!)
(b) Google to Yahoo! (Train: Yahoo!, Test: Google)
Figure 12: Google & Yahoo!: Classification Accuracy
We presented a technique for integrating documents from a source catalog into a master catalog that takes advantage of the implicit information present in the source catalog: that documents in the same category in the source catalog are similar. Our technique enhances the standard Naive Bayes classification by incorporating this information when classifying source documents. We showed through analysis that the highest accuracy achievable with our enhanced technique can be no worse than what can be achieved with the standard Naive Bayes classification.
Our experiments using synthetic as well as real data indicate that the proposed technique can result in large accuracy improvements. Using a tune set for selecting the weight to give the implicit information allows our enhanced algorithm to do substantially better, and never do significantly worse. Our experiments also showed that the size of the tune set can be very small and hence not place a significant burden on the user of the system.
Consider the documents belonging to a category S in N. Let
For a document d, let
From Eqs. 5 and 6, for a weight w
where K is the same for all the documents in S (K is the product of the denominators in the two equations). We use the notation to denote for a weight w.
. Expanding, we get
Similarly, since , we get
Combining equations 7 and 8, we get
Similarly for any w < wt:
Proof: Let Cp be the correct classification for document d. We split all other categories into five groups:
- . From Lemma 2, if Galways is non-empty, the document will never be correctly classified.
- . Cpwill always be preferred to any .
- . We assume that if two classes have the same probability, the document is assigned to both classes, and thus ignore this case.
- . From Lemma 3, there exists some weight whigh such that for all w < whigh, ,
- . From Lemma 3, there exists some weight wlow such that for all w > wlow, .
Rakesh Agrawal, Roberto Bayardo, and Ramakrishnan Srikant.
Athena: Mining-based Interactive Management of Text Databases.
In Proc. of the Seventh Int'l Conference on Extending Database Technology (EDBT), Konstanz, Germany, March 2000.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees.
Wadsworth, Belmont, 1984.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan.
Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases.
In Proc. of the 23rd Int'l Conf. on Very Large Databases, pages 446-455, 1997.
Learning Rules that Classify E-Mail.
In Proc. of the 1996 AAAI Spring Symposium on Machine Learning in Information Access, 1996.
S. T. Dumais and H. Chen.
Hierarchical classification of web content.
In Proc. of the 23rd Int'l ACM Conf. on Research and Development in Information Retrieval (SIGIR), pages 256-263, Athens, Greece, August 2000.
R. Dolin, J. Pierre, M. Butler, and R. Avedon.
Practical evaluation of IR within automated classification systems.
In Proc. of the 8th Int'l Conf. on Information and Knowledge Management (CIKM), Kansas City, Nov. 1999.
The Estimation of Probabilities: An Essay on Modern Bayesian Methods.
M.I.T. Press, 1965.
News Weeder: Learning to Filter Net-News.
In Proc. of the 12th Int'l Conf. on Machine Learning, pages 331-339, 1995.
Leah S. Larkey.
A patent search and classification system.
In The Fourth ACM Conference on Digital Libraries, pages 79-87, Berkeley, CA, August 99.
D.D. Lewis and M. Ringuette.
A comparison of two learning algorithms for text categorization.
In In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-92, 1994.
Agents that Reduce Work and Information Overload.
Communications of the ACM, 37(7):31-40, 1994.
Tom M. Mitchell.
Andrew McCallum and Kamal Nigam.
A Comparison of Event Models for Naive Bayes Text Classification.
In AAAI-98 Workshop on ``Learning for Text Categorization'', 1998.
M. Pazzani and D. Billsus.
Learning and Revising User Profiles: The identification of interesting web sites.
Machine Learning, 27:313-331, 1997.
T.R. Payne and P. Edwards.
Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface.
Applied Artificial Intelligence, 11:1-32, 1997.
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz.
A Bayesian Approach to Filtering Junk E-mail.
In Proc. of the AAAI'98 Workshop on Learning for Text Categorization, Madison, Wisconsin, 1998.
R. Segal and J. Kephart.
MailCat: An Intelligent Assistant for Organizing E-Mail.
In Proc. of the Third Int'l Conf. on Autonomous Agents, 1999.
M. Sahami, S. Yusufali, and M.Q.W. Baldonado.
Sonia: A service for organizing networked information autonomously.
In Proc. of the Third ACM Conference on Digital Libraries, pages 200-209, 1998.
- ... categories1
- Catalogs are often organized as hierarchies. We assume that any documents assigned to an interior node really belong to a conceptual leaf node that is a child of that interior node. Since we now have documents only at leaf nodes, we can flatten the hierarchy to a single level and treat it as a set of categories. Note that a document can still belong to multiple leaf nodes, but documents are only in leaf nodes and not interior nodes.
- ... pairs.2
- The essential idea is that if the document consists of both text and attributes, we can assume independence between the text and the attributes to compute , where dtext is the text for the document description, and dattr the set of attributes for the document.
- ... pages.3
- From Google (http://directory.google.com), we got Recreation/Autos, Arts/Movies, Recreation/Outdoors, Arts/Photography and Computers/Software. From Yahoo! (http://dir.yahoo.com), we got the corresponding categories: Recreation/Automotive, Entertainment/Movies_and_Film, Recreation/Outdoors, Arts/Visual_Arts/Photography, and Computers_and_Internet/Software.