WWW2007 Paper Details
Track:
Data Mining
Paper Title:
Detecting Near-Duplicates for Web Crawling
Authors:
  • Gurmeet Manku (Google Inc.)
  • Arvind Jain (Google Inc.)
  • Anish Das Sarma (Stanford University)
Abstract:
Near-duplicate documents are commonly found on the web. A pair of near-duplicate web pages differ from each other in a very small portion. The differences commonly consist of advertisements and timestamps. Such differences are irrelevant for web search. During web crawling, it is useful to quickly ascertain whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.

In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
Slot:
Alberta, Friday, May 11, 2007, 10:30am to 12 noon.
Full-text:
PDF version