A Data Model and Query Language for XML
With XML poised to become the de facto language for representing internet data, accessing and querying this data has become increasingly important. To query XML documents efficiently, we developed the QuiXote query processing system. The first step in designing a query processing system is to define an appropriate data model. QuiXote uses the QNX data model that captures the entire DTD semantics, as well as the structure, content and link information present in an XML document. For querying, QuiXote uses the QNX query language, which is a DTD-based query language for XML.
The QNX Data Model
Various data models have been proposed for XML. The semistructured data model as used in Lore [MAG97] and XML-QL [DFF98] is used for data with no schema. Therefore it is less useful when the XML document has a schema (DTD) associated with it. This model also cannot capture the order among child nodes. Using the relational data model for XML has been studied in [STH99, DFS99]. In [STH99], the DTD is used to obtain the relational schema. But the relational schema can capture only a subset of the DTD semantics. Capturing only a portion of the schema semantics will not result in incorrect query processing. But if updates are to be supported, the data model should capture the entire DTD semantics.
Considering the unsuitability of the existing data models, we define the QNX data model for XML repositories. The QNX data model views an XML repository as a set of <schema, setOfDocuments> pairs. In other words, we partition the set of documents in the repository based on their schema. The schema is viewed in this data model as a graph rooted at the DOCTYPE element. The XML document itself is viewed as a graph rooted at the document root element. This captures the intra-document linking in the document. QNX uses "edges" from a node in one graph to a node in another to capture the inter-document linking.
QuiXote uses the QNX data model for viewing a repository, where each document is viewed as a graph. But this does not require that the documents be stored as graphs. QuiXote stores the XML documents as a tree. Storing an XML document as a tree has many advantages in addition to answering queries on parent-child relationships efficiently - (a) Trees are much easier to handle than graphs. For example, the complexity of computing structural summaries (dataguides) as mentioned in Lore [MAG97] can be exponential for graphs, whereas for trees, such summaries can be computed in linear time. (b) Most storage and compression schemes for XML, such as Millau [GNS99] and PDOM [PDOM] store XML documents as trees. Therefore by storing the XML document as trees, QuiXote maintains the flexibility of using any tree storage and load model. (Presently it uses Millau storage and load model).
But storing XML documents as trees is not very efficient for answering queries on links. This is because queries on intra-document and inter-document links are supported using joins. QuiXote relies on link indexing to improve the efficiency of such queries.
The QNX Query Language
The QNX query language is based on the QNX data model. It supports queries on structure, content as well as link information present in XML documents. The query languages for XML either use path expressions (XQL [RLS98] and Lore [MAG97]) or tree structures (XML-QL [DFF98]). A tree can always be decomposed into a set of paths. Therefore a query language that uses tree structures has the same expressive power as one that supports multiple path expressions. An important factor to consider in designing a query language is its ease of use. Queries that do not specify conditions on siblings typically tend to be shorter if path expressions are used. To specify conditions on siblings, tree structures are easier. Based on our previous experience in designing a pattern match and replace language for XML (PatML [SLR99]) and in using QuiXote to perform mining on XML repositories, we decided that QNX shall use tree-structures.
Conclusions and Future Work
The QNX data model provides the capability to query on the structure, content and link information present in an XML document. Storing the document as trees allows us to perform most document processsing in linear time. It also improves the efficiency of processing queries on parent-child relationships. But storing the document as trees decreases the efficiency of answering queries on link relationships. We would like to investigate various index techniques for increasing the efficiency of these queries.
Persistent DOM Model for XML, GMD-IPSI, Last accessed Jan, 2000, http://xml.darmstadt.gmd.de/xql/xql-examples.html
Deutsch. A, Fernandez. M, Florescu. D, Levy. A, Suciu. D, XML-QL: A query language for XML, Last accessed Jan, 2000
Deutsch. A, Fernandez. M, Suciu. D, Storing Semistructured DATA with STORED, SIGMOD Conference, Philadelphia,
Pennsylvania, June 1-3, 1999, pp. 431-442.
Girardot. M, Sundaresan. N, Millau: an Encoding Format for Efficient Representation and Exchange of XML Documents over
the WWW, 9th International World Wide Web Conference, Amsterdam, Netherlands
McHugh. J, Abiteboul. S, Goldman. R, Quass. D, Widom. J, Lore : A Database Management System for Semistructured Data,
SIGMOD Record, September 1997, 26(3):54-66
Robie. J, Lapp. J, Schach. D, XML Query Language (XQL), Last Accessed Jan, 2000,
Sundaresan. N, Lee. S, Rollins. S, PatML: A Pattern Match/Replace Language for XML documents in XML, IBM Research
Technical Report, January 1999.
Shanmughasundaram. J, Tufte. K, He. G, Zhang. C, DeWitt. D, Naughton. J, Relational Databases for Querying XML Documents:
Limitations and Opportunities, Proceedings of the 25th VLDB Conference, Edinburg, Scotland, 1999.