Determining relevant pages

Determining relevant pages

Given a n-word document a = {w1, w2,...wn} and a set of n recognized words, one can represent q and a each as a vector of word frequencies $ \vec{q}\,$ and $ \vec{a}\,$. A common measure of similarity between two word frequency vectors $ \vec{a}\,$ and $ \vec{q}\,$ weighted by inverse document frequency (idf) is the cosine distance between them:

score($\displaystyle \bf q$,$\displaystyle \bf a$) = $\displaystyle {\frac{\sum_{w \in q,a} \lambda_{w}^{2} \cdot f_{q}(w) \cdot f_{a... ...\in q} (\lambda_{w}f_{q}(w))^{2} \cdot \sum_{w \in a}(\lambda_{w}f_{a}(w))^2}}}$ ,  
where fd(w) is the number of times word w appears in the document d and $ \lambda_{w}^{}$ is the inverse document frequency of the word w defined as:
$\displaystyle \lambda_{w}^{}$ = log$\displaystyle \left(\vphantom{\frac {\vert\cal{D}\vert} {\vert \{ d \in {\cal D} : f_{d}(w) > 0 \} \vert} }\right.$ $\displaystyle {\frac{\vert\cal{D}\vert}{\vert \{ d \in {\cal D} : f_{d}(w) > 0 \} \vert}}$ $\displaystyle \left.\vphantom{\frac {\vert\cal{D}\vert} {\vert \{ d \in {\cal D} : f_{d}(w) > 0 \} \vert} }\right)$  
where $ \cal {D}$ is the document set in consideration.



Sandeep Pandey 2003-03-05