WWW2007 Poster Details
Poster Title:
GigaHash: Scalable Minimal Perfect Hashing for Billions of URLs
Authors:
  • Kumar Chellapilla (Microsoft Live Labs)
  • Anton Mityagin (Microsoft Live Labs)
  • Denis Charles (Microsoft Live Labs)
Abstract:
A minimal perfect function maps a static set of n keys on to the range of integers {0,1,2,...,n-1}. We present a scalable high performance algorithm based on random graphs for constructing minimal perfect hash functions (MPHFs). For a set of n keys, our algorithm outputs a description of h in expected time O(n). The evaluation of h(x) requires three memory accesses for any key x and the description of h takes up 0.89n bytes (7.13n bits). This is the best (most space efficient) known result to date. Our approach reduces the space requirement to 43% of the well known previous minimal perfect hashing (MPH) scheme due to Czech, Havas and Majewski (1992) and to 77% of a more recent algorithm due to Botelho, Kohoyakawa, and Ziviani (2005). Using a simple heuristic and Huffman coding, the space requirement is further reduced to 0.79n bytes (6.86n bits). We present a high performance architecture that is easy to parallelize and scales well to very large data sets encountered in internet search applications. Experimental results on a one billion Url URL dataset obtained from Live Search crawl data, show that the proposed algorithm (a) finds an MPHF for one billion URL Urls in less than 4 minutes, and (b) requires only 6.86 bits/key for the description of h.
Full-text:
PDF version