WWW2006 - Random Sampling from a Search Engine's Index
| Skip to main content | Skip to navigation |

Register Now!

Random Sampling from a Search Engine's Index

  • Ziv Bar-Yossef, Technion - Israel Institute of Technology, Israel
  • Maxim Gurevich, Department of Electrical Engineering, Technion, Israel
Winner of Best Paper Award

Full text:

Presentation Slides:

Track: Search

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.

The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.

We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh data about the relative sizes of Google, MSN Search, and Yahoo!.

Citation

Bar-Yossef, Z. and Gurevich, M. 2006. Random sampling from a search engine's index. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 367-376.
DOI= http://doi.acm.org/10.1145/1135777.1135833

Other items being presented by these speakers

  • Do not Crawl in the DUST: Different URLs with Similar Text (Posters Track)

Organised by

ECS Logo

in association with

BCS Logo ACM Logo

Platinum Sponsors

Sponsor of The CIO Dinner


Become a sponsor or exhibitor
Valid XHTML 1.0! IFIP logo WWW Conference Committee logo Web Consortium logo Valid CSS!