Click to edit Master text styles
(1.5)First I will explain briefly the motivation behind this work. *Most web publications are written in HTML. Tables are commonly used in such documents to present relational data. Since they are inherently information rich, table understanding has many important applications on the web. Although most tables in HTML documents are marked by the table tags, the inverse is not true not all table elements indicate true relational tables. In fact, most of the time the table tag is used simply to achieve a multi-column layout effect. So the first step in table understanding on the web is table detection. In this study, we concentrate on tables that are coded using the table tag. So by table detection we refer to the process of determining whether a given table element represents a genuine table.
(.5)Obviously the first question we need to answer before we can proceed is: what is a genuine table. Here is our working definition: In other words, the proper interpretation of a cell in a genuine table requires reference to either row or column headings. In contrast a table element that does not have such property is called a non-genuine table.
To give you an example .. Where as this one actually represent a list.The job of table detection is essentially to separate genuine tables like these from non-genuine ones like this.
(1.5)There has been a limited amount of previous work on this subject. However, most previous approaches are based on heuristics. And the algorithms were tested on either a very small database containing up to hundreds of samples, or a domain specific databases such as The main goal of our study was to design a more generic and scalable table detection algorithm, therefore we decided to explore machine learning based algorithms that are automatically trainable and thus extendable. And the algorithms were trained and tested on a large database containing over 10000 samples.
(0.5)Here are what we see as the main contributions of our work. First we designed a novel set of features to capture both layout and content characteristics of genuine tables.Finally we tested two popular classification schemes for their effectiveness in this particular application. Next I am going to explain each one of these components in detail, starting with the features.
(1.5)Before feature extraction each HTML doc has to be pre-processed. There are three main steps here. First the page is parsed to obtained the document tree which describes the HTML structure of the page. We then extract leaf table elements from the tree. We decided to concentrate on leaf table elements because from our observations almost all genuine tables are represented by leaf table elements. Finally, each table element is analyzed in a pseudo-rendering process to obtain accurate row/column counts. This is necessary because simply counting tr and td tags are not sufficient for this purpose. Many other tags can affect the layout of a table element, including ..The process is similar to what is used in browser to configure the layout of a table element. And gives us a more accurate description of the physical structure of the table.
(1)Here is a summary of our feature set. There are three groups of features. The first group is extracted completely from the physical structure of the table. The second group looks at the content type of the cells in the table the simplest kind of content analysis. And finally, the text content feature is designed to exploit the textual information within a table and look at it from a text classification angle. Next I am going to explain each group of features in detail.
(1)The first group is the layout features. The first few features in this group are rather straightforward ..The last feature, cumulative length consistency, or CLC, is a more complex one. It belongs to a type of features called consistency features which we think play a very important role in table detection. The observation behind this is that in a genuine table, one can usually detect a certain kind of consistency, in terms of physical layout, or content type, or syntax, or semantics, found either along the row dimension, or the column dimension. Such consistency is usually much weaker, or does not exist, in a non-genuine table. This kind of feature was explored in previous work in a very narrow way: for example, some researchers looked for rows or columns that consists entirely of dates, or numbers. We attempted to explore this feature type more systematically in a broader sense. And CLC is an instance of this feature type.
(2)Here is how it is calculated. We first compute the within-row clc, denoted as clc to r. For each row, we compute the mean cell length, m. Then for each cell in the row, we compute a length consistency measure defined by this formula. L(c) is the length of this particular cell. It results in a number between 0.5 and 0.5, with 0.5 indicating extreme inconsistency and 0.5 indicating extreme consistency. We then get a cumulative measure by adding up all the cells in the row. Since each individual consistency measure can be either negative or positive, a very consistent cell will cancel out a very inconsistent cell. As a result, if the majority of the cells in a row are consistent, the cumulative measure will be positive, and otherwise it will be negative. We then compute the overall within-row CLC by averaging all the rows. The within-column. Finally,. We only take the maximum here because consistency is usually found along only one direction and not both.
(0.5)The second feature group is the content type features. It has two components. The first is the calculated over the whole table. This gives us 7 features. The last feature is another instance of the consistency feature type and is defined in a similar way as CLC.
(1)We have described two instances of the consistency feature type these are the only two used currently. We think similar consistency features can also be computed for syntax or semantics and we plan to explore that in our future work.
(1)Since most tables contain a lot of text, we decided to explore the possibility of deriving a feature by treating it as a text classification problem. Text classification is a well studied problem in IR and many algorithms have been proposed. However, there are many special characteristics in our particular application. . Finally, since this is only one of the many features, we need to have a continuous score as opposed to a binary decision in order to incorporate it with other features. Considering all these, we designed a feature based on the vector space model and we call it the word group feature.
(1.5)Its definition is quite involved and I will try to explain in the most simple terms to give you a flavor of it. A more accurate description can be found in our paper. First we compose a set of words extracted from all the tables in a training set, after proper preprocessing such as stemming and stop word removal. For each word in the set, we get the following counts From these counts we derive these two base weights for each word, one for genuine table class and one for non-genuine table class. Each weight is defined as the term frequency in its own class, modified by the ratio between the document frequency in its own class and the document frequency in the opposite class. Those familiar with IR will notice that this is clearly inspired by the standard tf idf measure used in IR.
(1.5)For an incoming table, we construct its own word set. We then get the effective word set by finding the intersection of the generic word set and the new individual word set, and denote it as .. Where m_1, etc are indexes back to the generic work set. We then compute three vectors. Vector Gs represents the genuine table class and is composed with the corresponding genuine table weights divided by a normalization factor. Vector Ns . And vector It represents the incoming table and Finally the word group feature is defined as the ratio of the dot product . Basically, it measure whether the vector representing the incoming table correlates better with the genuine table vector or the non-genuine table vector.
(1)Now that we have the features, we need to decide on the classifiers to use. We picked two popular classifiers to experiment with. Decision tree was selected because as you might have noticed, our feature set is highly non-homogeneous. Some feature are integers, some are floating point numbers. Decision trees can handle that quite well. Secondly, decision trees have been widely used and have proved to perform very well in that context. We also wanted to experiment with Support Vector Machines because they have a strong theoretical foundation based on structural risk minimization, and it was shown to be the best performer in a recent text categorization test.
(1)The decision tree algorithm used in our study was implemented from the description by Haralick and Shapiro in the book It is similar to the widely used C4.5 algorithm: It uses a top-down greedy strategy, the decision rule at each internal node is thresholding one of the feature components; the selection of both the threshold and the feature component is based on an entropy based purity measure, and finally it uses the maximum tree depth and minimum sample size as stopping criteria.
(1)One important characteristic of Support Vector machines is that they are in some sense universal learners in its basic form, SVM implements a linear classifier, however many different kernels have been adopted which can be used to implement other types of classifiers, including The implementation we used is the . And we tested both the linear and the radial basis function kernels.
(1)Our goal here was to construct a large database that includes tables of as many different varieties as possible. At the same time, we also needed to ensure that we get a significant number of genuine tables in the database. For this practical reason, we biased the data collection towards web pages that are more likely to contain genuine tables .. Even in this somewhat biased collection, genuine tables only account for about 15% of all leaf table elements
(1). And the results are pooled to compute the final performance measure. We use these three standard measures. Recall: the percentage of true genuine tables that are classified as genuine. Precision: the percentage of classified genuine tables that are truly genuine. F: average of P and R.
(1.5)As we can see, the best results was achieved although the addition of the word group feature only contributed to a very small improvement. The reason for that, we think, is that genuine table and non-genuine table are very broad classes, each containing samples covering many different subjects. So simply using bag of words can not results in a feature that is too spread out and thus not very discriminative. On possible solution is to use word categories, instead of raw words for this feature to be explored further
(0,5)As we can see, the decision tree and SVM with radial basis function give comparable performance in terms of the f measure, and both are much better than SVM with linear kernel. (These are measures when the highest f measure is achieved within each setting.) Also, In this particular setting, SVM with radial basis function gives similar precision and recall, while decision tree achieves a better precision but a worse recall.
(1)Here is the final set of results, comparing the machine learning system to a rule based system. So this shows that the rule based system, even after some manual adjustment, does not adapt well to a new and more diverse database well, and the machine learning based system is indeed more versatile.
(0.3)Now let me show you some examples Here are two examples where the table is correctly classified.
Example 1: length inconsistency; many hyper link (unusual in genuine tables)Example 2: whole body coded using one <tr>, <p> used for separate rows (one tag we failed to consider in our pseudo rendering module)
(0.5)Sample 1: row consistency. In fact one could argue that this is in fact a genuine table, with implicit row headings of: Title, Name, Affiliation, phone number and email address. This example demonstrate one inherent difficulty in this work, namely whether or not a table is a genuine table is some times ambiguous even to humans. Sample 2: reasonable consistency along the columns. This is a case where current features are not adequate. In order to correctly classify a table like this, one needs to have more advanced language analysis to discover the fact that there is no logical relation/semantic consistency among the cells and that, of course, is a very difficult task.
(0.5)We feel that the current detection algorithm performs well enough for us to proceed to investigate the next step interpretation, such as extracting the title and headings, and eventually, complete recovery of the logical structure of a table.