Track: E* Applications
Extraction and Search of Chemical Formulae in Text Documents on the Web
- Bingjun Sun (Pennsylvania State University)
- Qingzhao Tan (Pennsylvania State University)
- Prasenjit Mitra (Pennsylvania State University)
- C. Lee Giles (Pennsylvania State University)
Often scientists seek to search for articles on the Web related to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets back articles where the exact keyword string expressing the chemical formula is found. Searching for the exact occurrence of keywords while searching results in two problems for this domain: a) if the author searches for CH4 and the article has H4C, the article is not returned, and b) ambiguous searches like "He" return all documents where Helium is mentioned as well as documents where the pronoun "he" occurs. To remedy these deficiencies, we propose a chemical formula search engine. To build a chemical formula search engine, we must solve the following problems: (1) extract chemical formulae from text documents, (2) index chemical formulae, and (3) design a ranking function for articles where the chemical formulae occur. Furthermore, query models are introduced for formula search, and for each a scoring scheme based on features of partial formulae is proposed to measure the relevance of chemical formulae and queries. We evaluate algorithms for identifying chemical formulae in documents using a classification method based on Support Vector Machines (SVM), and a probabilistic model based on conditional random fields (CRF). Different methods for SVM and CRF to tune the trade-off between recall and precision for imbalanced data are proposed to improve the over-all performance. A feature selection method based on frequency and discrimination is used to remove uninformative and redundant features. Experiments show that our approaches of chemical formula extraction work well, especially after trade-off tuning. The results also demonstrate that feature selection can reduce the index size without changing the ranked query results much.