Vector Space Model (VSM) nextupprevious
Next:Okapi Similarity Measurement (Okapi)Up:Combining the HITS-based AlgorithmsPrevious:Combining the HITS-based Algorithms

Vector Space Model (VSM)

The vector space model has been widely used in the traditional IR field [11,12]. Most search engines also use similarity measures based on this model to rank Web documents. The model creates a space in which both documents and queries are represented by vectors. For a fixed collection of documents, an $m$-dimensional vector is generated for each document and each query from sets of terms with associated weights, where $m$ is the number of unique terms in the document collection. Then, a vector similarity function, such as the inner product, can be used to compute the similarity between a document and a query.

In VSM, weights associated with the terms are calculated based on the following two numbers:

The similarity $sim_{vs}(q,x_i)$, between a query $q$ and a document $x_i$, can be defined as the inner product of the query vector $Q$ and the document vector $X_i$:
 
\begin{displaymath}sim_{vs}(q,x_i)=Q \cdot X_i = \frac{\sum_{j=1}^{m} v_j \cdot......qrt{\sum_{j=1}^{m} (v_j)^2 \cdot \sum_{j=1}^{m} (w_{ij})^2 }}\end{displaymath} (5)
where $m$ is the number of unique terms in the document collection. Document weight $w_{ij}$ and query weight $v_j$ are
\begin{displaymath}w_{ij} = f_{ij}w_{ij} = f_{ij} \cdot \log(N/d_j) \qquad \mbox{and}\end{displaymath}
 
\begin{displaymath}v_j = \left \{ \begin{array}{ll}\log (N/d_j) & \mbox{$y_j$......in $q$} \\0 \qquad & \mbox{otherwise}\end{array} \right .\end{displaymath} (6)
Cosine normalization is used to normalize the similarity weight although queries are short in our experiments.

Due to the dynamic nature of the Web and inability to access the whole Web, the VSM method cannot be applied directly in evaluating the precision of search engines [20]. In Equation (5), the inverse document frequency $g_j$ is not available because $N$ and $d_j$ are often unknown for the documents on the Web. There are three ways to deal with this problem: (1) making simple assumptions such as $g_j$ is a constant for all the documents [22]; (2) estimating parameter values by sampling the Web [1]; and (3) using information from search engines, or/and experts' estimation. In this paper, we use a combination of the second and third approaches.

To apply VSM to the Web, we treat the Web as a big database and estimate the parameter values of VSM. To estimate $N$, we use the coverage of Google because Google is the search engine with the largest size, which contains about 1 billion pages in August, 2001 as reported in [24]. But because Google uses link data, it can actually achieve a larger coverage of over 1.3 billion pages. Thus, we set $N = 1,387,000,000$ which is the number of pages in Google's index. To estimate $d_j$, the number of documents containing a certain term $y_j$, we use information extracted from several search engines as follows:

  1. Submit the term $y_j$ to several search engines (we use five search engines in our experiments). Each search engine $i$ returns $\alpha_i$, the number of documents containing the term.
  2. Calculate the normalized value $\gamma_i$ for each search engine $i$, based on $\alpha_i$ and the relative size ($\beta_i$) of the search engine to that of the whole Web: $\gamma_i = \alpha_i / \beta_i$. The sizes of the five search engines, AltaVista, Fast, Google, HotBot, and NorthernLight, used in our experiments are 550, 625, 1000, 500, and 350 million documents, respectively, as reported in previous studies [24]. Since the total size of the Web is estimated to be about 1.387 billion documents, the relative sizes ($\beta_i$) of these search engines are 0.397, 0.451, 0.721, 0.360, and 0.252, respectively.
  3. Take the median of the normalized values of the search engines as the final result.
After getting the values of $N$ and $d_j$$sim_{vs}$ can be easily computed.


nextupprevious
Next:Okapi Similarity Measurement (Okapi)Up:Combining the HITS-based AlgorithmsPrevious:Combining the HITS-based Algorithms
2002-02-18