Combining Classifiers to Identify Online Databases
- Luciano Barbosa (University of Utah)
- Juliana Freire (University of Utah)
We address the problem of identifying the domain of online databases. More precisely, given a set F of Web forms automatically gathered Web by a focused crawler and an online database domain D, our goal is to select from F only the forms that are entry points to databases in D. Having a set of Web forms that serve as entry points to similar online databases is a requirement for many applications and techniques that aim to extract and integrate hidden-Web information, including meta-searchers, database selection tools, hidden-Web crawlers, form-schema matching and merging, and in the construction of online database directories. We propose a new strategy that automatically and accurately classifies online databases based on features that can be easily extracted from Web forms. By judiciously partitioning the space of form features, this strategy allows the use of simpler classifiers that can be constructed using learning techniques that are better suited for each partition. Experiments using real Web data in a representative set of domains show that the use of different classifiers leads to high accuracy, precision and recall. This indicates that our modular classifier composition provides an effective and scalable solution for classifying online databases.