Next:ExperimentsUp:Combining the HITS-based AlgorithmsPrevious:Cover Density Ranking (CDR)
Three-Level Scoring Method (TLS)The TLS method is developed to take human expectation of search results into account and is modeled after existing manual approaches. Many existing manual methods use the following criteria to assign relevance scores [8,9,17]:
- Relevant links are links that are related to information needs of a query or have many links to other Web pages which may be useful to the query. They get score 2.
- Slightly relevant links are the pages that are partially related to a query. They contain too short a definition to be useful, or contain the introduction of company products that have some technical solutions to a problem relevant to the query, or contain some description to a related textbook. They are given score 1.
- Duplicate links are the pages that appear in the returned links with the same URL more than one time, and the links that differ in that one has index.htm or index.html as the suffix and the other does not. They are given score 0. Mirror sites are not counted as duplicates.
- Inactive links are the links that give error messages like file not found (404), forbidden, or server not responding (603) errors. These links are given score 0.
- Irrelevant links are the links containing irrelevant information to the query. They get score 0.
- Given a query phrase with terms and a Web page , a raw score is calculated as:
- Convert the raw score to a three-level relevance score through thresholding, with value 2 for relevant, 1 for partially relevant, and 0 for irrelevant:
Let's use a simple example to illustrate how the method works. For query ``distributed computing systems'', there are 1 phrase with 3 terms (distributed computing systems), 3 sub-phrases with 2 terms (distributed computing, computing systems and distributed systems), and 3 sub-phrases with 1 term (distributed, computing and systems). Assume that in a given Web page, the number of occurrence of the exact phrase is , the total number of occurrence of the sub-phrases with two terms is , and the total number of occurrence of the sub-phrases with one term is . If the three parameters in the algorithm are set as , , and , then we get , and . In the previous work, it is observed that performance evaluation results were not very sensitive to the parameter values of TLS and the default setting, , , and , generally worked well . Thus, we used the default setting in the experiments.
Next:ExperimentsUp:Combining the HITS-based AlgorithmsPrevious:Cover Density Ranking (CDR) 2002-02-18