»XPath Evaluation on XML Streams« Project

We aim at providing techniques for efficient XML query evaluation on unbounded XML data streams.

Motivation

As XML data becomes a de facto standard on the Web, querying XML data is becoming an issue of great importance, a real research challenge. Several solutions have been proposed in the last years to this challenge, described in various working drafts of W3C or research papers. As a core of this common effort has been identified a navigational approach for information localization in XML data, comprised in a powerful language called XPath [1].
Initially, speaking about XML data was not so different as speaking about other non-XML Web documents, where the size does not play an important role. It was considered straightforward to get all the XML data into memory and then to query it. Approaches like DOM [2] stands for this. Nowadays, there are available huge XML repositories, where size matters. Even more, unbounded XML streams gain more importance due to their practical applications from publish-subscribe systems to data integration over the Web. As quickly realised by the XML community, the random access to XML data precludes efficient query processing with no outstanding gain, and sometimes it makes it even impossible.
Stream-based processing comes as an attempt to overcome pitfalls of the traditional DOM-based processing:

Progressive processing, i.e. stream-based processing generating partial results as soon as they are available, even before the full input is completely read, gives rise to a more efficient evaluation in certain contexts:

XPath Rewriter

Abstract: The XPath language proposes constructs for back and forth navigation in the XML document tree. Clearly, the backward navigation precludes efficient evaluation of XPath expressions on XML data streams, as the evaluation of backward navigation requires to buffer already encountered stream fragments. Therefore, rewriting rules for XPath expressions are proposed, that can rewrite expressions involving backward navigation into forward-only expressions.

Publications: [OMFB02], [OMFB01].

Time: 01.07.2001 - 01.03.2002.

Streamed XPath Evaluator

Abstract: A streamed and progressive evaluator (SPEX) of an XPath fragment against XML streams has been defined.

Publications: [OFB04], [BFO03], [OFB03], [OKB02], [K02].

Time: 01.10.2001 - running.

Optimizing Multiple Queries against XML Streams

Abstract: The root of this activity stems from the observation that several XPath queries can be translated into the same SPEX network. This activity investigates first the problem of how to find a cost-optimal multiple query plan (SPEX network) that allows the simultaneous evaluation of multiple queries (tens of thousands of queries of average size 10) against the same XML stream. Several heuristics are proposed for detecting common parts of query plans for different queries and then for sharing them in the multiple query plan. An extensive experimental evaluation shows that sharing arbitrary operators under realistic cost functions results in multiple query plans that have consistently lower cost for reasonable sets of queries than multiple query plans where only common prefixes are considered. In most cases, the relative improvement is higher than 50%. Although the time for generating such multiple query plans is higher than for multiple query plans where only common prefixes are shared, the increase in time is within an acceptable margin.

Publications: [F03].

Time: 01.02.2003 - 31.07.2003.

SPEX Viewer: A Graphical User Interface for SPEX

Abstract: A graphical user interface for SPEX is added to the project. In this way, the user can see at every incoming message from the stream the processing status in terms of current states, stack configurations, and result candidates. The graphical interface will be available also on the Web.

Publications: [S03].

Time: 01.10.2002 - 8.02.2003.

Approximate Streamed Evaluation of XPath under Memory Constraints

Abstract: The SPEX approach can not guarantee bounded memory usage for all supported queries. The goal of this activity is to investigate the cases that can be safely treated and the ones that need unbounded memory, independent of the XML stream. Strategies for coping with the latter situation are to be investigated.

Publications: [DS03].

Time: 01.11.2002 - running.

Well-Ordered XPath Queries

Abstract: The evaluation of XPath queries against XML streams can require in general memory linear in the size of the input XML stream [SPEX]. This happens in cases where potential answers to queries have to be buffered until the decision regarding their appartenance to the result can be met. However, such a time-limited buffering is not always possible, especially when the XML stream is unbounded or infinite. In such cases, the evaluation of queries could require unbounded (or even infinite) memory. Several approaches are proposed to ameliorate this situation, mainly focusing on enforcing hard memory constraints for the query evaluation, e.g., [SAQL], at the price of retrieving approximate instead of exact answers. It would be interesting to identify classes of XPath queries, for which there can be always inferred a bound on the required memory for query evaluation. More precisely, such classes of queries should be syntactically characterizable and memory bounds should be inferrable as functions only of the input query, or also of the input stream. In this sense, some classes can be described independent of the structure of the stream, or dependent on the stream schema. Well-ordered XPath Queries: As a prime target towards describing such classes, one can envisage as a first step the characterization of XPath queries, the evaluation of which does not require at all buffering of potential answers. Such a query property can be detected, e.g., by checking whether all constraints (or predicates) of a potential answer are fulfilled along the stream before encountering that potential answer, i.e., before fully evaluating the head path that leads to the answers. However, a statical analysis of the queries, i.e., without the inspection of the stream and before the query evaluation, could be sufficient in some cases to detect such a query property. As early investigation of the supervisors shows, such a property, called ad-hoc well-orderedness, could be detected statically for some classes of queries. The intuition behind ther property name well-orderedness resides in the informal observation that for a well-ordered query its predicates are always evaluated before before its head path, thus there exists an order of the evaluation of subqueries that ensures the buffer-freeness of the evaluation of the whole query.

Time: 01.09.2003 - running.

References


This page: http://www.pms.informatik.uni-muenchen.de/forschung/xpath-eval.html        (validation)
Last modified: Thu Mar 31 20:30:26 CEST 2005