Determining relevant pages

# Determining relevant pages

Given a n-word document a = {w1, w2,...wn} and a set of n recognized words, one can represent q and a each as a vector of word frequencies and . A common measure of similarity between two word frequency vectors and weighted by inverse document frequency (idf) is the cosine distance between them:

 score(,) = ,
where fd(w) is the number of times word w appears in the document d and is the inverse document frequency of the word w defined as:
 = log
where is the document set in consideration.

Sandeep Pandey 2003-03-05