Refereed Papers

Track: WWW in China: Mining the Chinese Web

Can Chinese Web Pages be Classified with English Data Source?


  • Xiao Ling(Shanghai Jiao Tong University)
  • Gui-Rong Xue(Shanghai Jiao Tong University)
  • Wenyuan Dai(Shanghai Jiao Tong University)
  • Yun Jiang(Shanghai Jiao Tong University)
  • Qiang Yang(Hong Kong University of Science and Technology)
  • Yong Yu(Shanghai Jiao Tong University)

As the World Wide Web in China grows rapidly, mining knowledge in Chinese Web pages becomes more and more important. Mining Web information usually relies on the machine learning techniques which require a large amount of labeled data to train credible models. Although the number of Chinese Web pages increases quite fast, it still lacks Chinese labeled data. However, there are relatively sufficient English labeled Web pages. These labeled data, though in different linguistic representations, share a substantial amount of semantic information with Chinese ones, and can be utilized to help classify Chinese Web pages. In this paper, we propose an information bottleneck based approach to address this cross-language classification problem. Our algorithm first translates all the Chinese Web pages to English. Then, all the Web pages, including Chinese and English ones, are encoded through an information bottleneck which can allow only limited information to pass. Therefore, in order to retain as much useful information as possible, the common part between Chinese and English Web pages is inclined to be encoded to the same code (i.e. class label), which makes the cross-language classification accurate. We evaluated our approach using the Web pages collected from Open Directory Project (ODP). The experimental results show that our method significantly improves several existing supervised and semi-supervised classifiers.

Inquiries can be sent to: Email contact: program-chairs at www2008.org

