WWW2007: Program
Top of Menu Home CFP Program Committees Key Dates Location Hotel Registration Students Sponsors Media Submission Tutorials Workshops Travel Info Proceedings

Poster Papers

Track: Search

Paper Title:
EPCI: Extracting Potentially Copyright Infringement Texts from the Web

Authors:

  • Takashi Tashiro (Waseda University)
  • Takanori Ueda (Waseda University)
  • Taisuke Hori (Waseda University)
  • Yu Hirate (Waseda University)
  • Hayato Yamana (Waseda University)

Abstract:
In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine APIs, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.

PDF version

HTML version

























sponsors