Track: Data Mining
Detecting Near-Duplicates for Web Crawling
- Gurmeet Manku (Google Inc.)
- Arvind Jain (Google Inc.)
- Anish Das Sarma (Stanford University)
Near-duplicate documents are commonly found on the web. A pair of near-duplicate web pages differ from each other in a very small portion. The differences commonly consist of advertisements and timestamps. Such differences are irrelevant for web search. During web crawling, it is useful to quickly ascertain whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.
In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.