WWW2007: Program
Top of Menu Home CFP Program Committees Key Dates Location Hotel Registration Students Sponsors Media Submission Tutorials Workshops Travel Info Proceedings

Poster Papers

Track: XML

Paper Title:
Adaptive Record Extraction From Web Pages

Authors:

  • Justin Park (University of Calgary)
  • Denilson Barbosa (University of Calgary)

Abstract:
We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns. We compare our method experimentally to the state-of-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records. (such as entries on blogs).

PDF version





















sponsors