WWW2007: Program
Top of Menu Home CFP Program Committees Key Dates Location Hotel Registration Students Sponsors Media Submission Tutorials Workshops Travel Info Proceedings

Poster Papers

Track: XML

Paper Title:
U-Rest: An Unsupervised Record Extraction SysTem


  • Yuan Kui Shen (MIT CSAIL)
  • David Karger (MIT CSAIL)

We demonstrate a system that extracts record sets from record-list web pages with no direct human supervision. Our system, U-REST, reframes the problem of unsupervised record extraction as a two-phase machine learning problem with a clustering phase, where structurally similar regions are discovered, and a record cluster detection phase, where discovered grouping of regions are ranked by their likelihood of being records. This framework simplifies the record extraction task, and allows for independent analysis of the algorithms and the underlying features. In our work, we survey a large set of features under this simplified framework. We conclude with an preliminary comparison of U-REST against similar systems and show improvements in the extraction accuracy.

PDF version

HTML version