Extraction as tree rewriting next up previous
Next: Extraction of lists as Up: Extraction as classification Previous: Extraction as classification

Extraction as tree rewriting

In earlier work [5] we described a special-purpose language for writing wrappers. A program in this language manipulates an HTML parse tree (that is, a tree with nodes labeled by tag names like body, table, and ul), primarily by deleting and relabeling nodes. A wrapper in this language converts the parse tree for a Web page into another tree labeled with terms from the knowledge base, which can be stored directly in the knowledge base. This paradigm for extraction is clearly not sufficient for all purposes, since it is impossible to extract along boundaries not indicated by HTML markup commands; for instance, it is impossible to separate names from affiliations in the HTML page shown on the left-hand side of Figure [*]. In practise, however, the language is almost always expressive enough to wrap the inputs of WHIRL.1

As a first step in understanding the problems involved in automatically learning wrappers, we collected 111 different wrapper programs, all written as part of the WHIRL project, but written over several months in several different domains. Each of these wrappers was paired with a single sample page that it correctly wrapped. This was as complete a sample of working conversion programs as we could assemble; the only existing wrappers that were discarded were ones for which no correctly-wrapped HTML pages were available, due to changes in the associated Web sites. Of these 111 wrappers, 84 (or nearly 75%) fell into two special classes, which we call simple lists and simple hotlists.


  
Figure: A simple list, a simple hotlist, and the data that would be extracted from each.



						A Simple List

 
HTML Source:
<html><head>...</head>
<body>
<h1>Editorial Board Members</h1>
<table> <tr>
<td>G. R. Emlin, Lucent</td>
<td>Harry Q. Bovik, Cranberry U</td>
</tr> <tr>
<td>Bat Gangley, UC/Bovine</td>
<td>Pheobe L. Mind, Lough Tech</td>
</tr> <tr>
...
</table>
...
 
Extracted data:
G. R. Emlin, Lucent
Harry Q. Bovik, Cranberry U
Bat Gangly, UC/Bovine
...






						A Simple Hotlist

 
HTML Source:
<html><head>...</head>
<body>
<h1>My Publications</h1>
<ul>
<li>Optimization of fuzzy neural
networks using distributed parallel
case-based genetic knowledge discovery
(<a href=``buzz.ps''>postscript</a>,
 <a href=``buzz.pdf>PDF</a>)</li>
<li>A linear-time version of GSAT
(<a href=``peqnp.ps''>postscript</a>)</li>
...
 
Extracted data:
Optimization ...(postscript,PDF) buzz.ps
Optimization ...(postscript, PDF) buzz.pdf
A linear-time version of ... peqnp.ps
... ...




In a page containing a simple list, the structure extracted is a one-column relation containing a set of strings s1,...,sN, and each si is all the text that falls below some node ni in the parse tree. In a simple hotlist, the extracted structure is a two-column relation, containing a set of pairs <s1,u1>,...,<sN,uN>; each si is all the text that falls below some node ni in the parse tree; and each ui is a URL that is associated with some HTML anchor element aithat appears somewhere inside ni. Figure [*] shows the HTML source for a simple list and a simple hotlist, together with the data that is extracted from each. Notice that, although we use the term ``list'', it is not important that the information is presented in a format that looks like a list to the user; for instance, our example for a ``simple list'' is formatted as a table with two columns.

Henceforth we will concern ourselves only with the problem of extracting simple lists and simple hotlists. We will evaluate all methods on the collection of 84 benchmark problems described above. In the benchmarks, we ignore certain operations performed by the wrappers. Some wrappers included filter predicates, which allow them to ignore certain table entries; for instance, a wrapper might extract a pair <s,u> only if the URL u contains the substring ``.ps''. These filter predicates were removed, since this sort of filtering can be performed just as easily after extraction using mechanisms in the knowledge base. A few wrappers were also associated with sed scripts, which are applied (by the interpreter for the tree-rewriting language) to a Web page before parsing.2 Additional preprocessing steps are applied to every Web page; for example, relative URLs are always translated to absolute ones. We simplified the problem by assuming that all preprocessing steps are known, including any page-specific sed scripts--i.e., we paired each wrapper with a preprocessed Web page input.


next up previous
Next: Extraction of lists as Up: Extraction as classification Previous: Extraction as classification
Wei Fan
1999-03-04