Finding Related Communities in the Web
Masashi Toyoda and Masaru Kitsuregawa
Institute of Industrial Science, University of Tokyo
IntroductionWe propose a new web search technique, which finds related communities from a given URL. A community is a set of web pages written by authors who have a common interest on a specific topic, such as fan pages of a professional baseball team. Our technique finds a community that includes a given URL, and communities on related topics, using hyperlink analysis. For example, when a fan page of a baseball team is given, our technique finds a fan community of the team, and a community of official homepages of baseball teams.
Our technique allows the user to perform new type of navigation through the web. It provides additional ways not only to related pages, but also to related communities.
Related WorkThere are some previous work that find related pages from a given URL. Kleinberg proposed the HITS algorithm , which extracts strongly connected web pages from search results of altavista and their neighbors. He also showed that it can be used for finding related pages to a given URL. Dean proposed the Companion and the Cocitation algorithm  that are based on the same concept of the HITS. We have extended these algorithms to find not only related pages but also related communities.
Kumar performed trawling  on a huge snapshot of the web, and found over 100,000 communities. The trawling puts emphasis on locating communities on a huge data set, and does not handle relations between communities. On the other hand, our purpose is finding related communities near the given URL.
AlgorithmOur algorithm uses a modified version of the HITS , which calculates the top N related pages to a given URL. It first creates subgraph of the web by following hyperlinks twice, from a given URL, in both forward and backward directions. Then it calculates hub and authority scores of each nodes on the subgraph, and returns top N authority pages. Our modified HITS is similar to the Companion algorithm , but have differences in creating subgraph from the given URL and in putting weights on edges.
The main idea of our algorithm is applying the modified HITS on each URL in the top N URLs, then clustering these results. The process is as follows.
- Apply the modified HITS on a given URL, and get a list (L) of the top N URLs.
- Apply the modified HITS on each URL in L, and collect N result lists (L1,,, LN).
- Create clusters of the result lists (L1,,, LN). In our current implementation, we put two lists into a cluster, when the lists share at least M URLs.
- Output each cluster, removing duplicate URLs in its lists.
In this algorithm, it is important to select appropriate parameters for the modified HITS and clustering. We omit details of parameter tuning in this abstract because of the limit in space.
ExperimentsFor experiments, we built our own connectivity server on Japanese web pages. Using a web crawler, we collected 17 million web pages in Japanese from July to September, 1999. Then, we mapped connectivity information of those pages on main memory, so that our search engine can access directly. We implemented the connectivity server and our algorithm on Sun Enterprise Server 6500 with 8 CPU and 4GB memory. The current implementation returns a result in 5-10 seconds.
From our experiments, we found that this algorithm first finds a community near the given URL, then finds communities on broader topics. For example, when a fan page of a baseball team is given, the result includes a fan community of the team, and a community of official homepages of baseball teams. The following is a result when a fan page of SONY VAIO computers is given.
In this case, the top 10 result of the process (1) includes the VAIO official page, and this page becomes a seed for the community of computer makers. This phenomenon happens in many cases.
- Input: http://www.sonyvaio.com/
- Community 1 (Fan pages of SONY VAIO computers):
- Community 2 (Homepages of computer makers):
SummaryWe have proposed a new web search technique, which finds related communities from a given URL. Our algorithm first finds a community near the URL, then finds broader communities. We are now trying to find smaller communities from a broader community, and construct navigation ways in both directions.
- Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998.
- Jeffrey Dean and Monika R. Henzinger. Finding related pages in the World Wide Web. Proceedings of the 8th World-Wide Web Conference, 1999.
- Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins. Trawling the Web for emerging cyber-communities. Proceedings of the 8th World-Wide Web Conference, 1999.