Motivation
As XML data becomes a de facto standard on the Web, querying XML data is becoming an
issue of great importance, a real research challenge. Several solutions have been proposed
in the last years to this challenge, described in various working drafts of W3C or research
papers. As a core of this common effort has been identified a navigational approach for
information localization in XML data, comprised in a powerful language called XPath [1].
Initially, speaking about XML data was not so different as
speaking about other non-XML Web documents, where the size does not play an important
role. It was considered straightforward to get all the XML data into memory and then
to query it. Approaches like DOM [2] stands for this. Nowadays, there
are available huge XML repositories, where size matters. Even more, unbounded XML streams
gain more importance due to their practical applications from publish-subscribe
systems to data integration over the Web. As quickly realised by the XML community,
the random access to XML data precludes efficient query processing with no outstanding gain,
and sometimes it makes it even impossible.
Stream-based processing comes as an
attempt to overcome pitfalls of the traditional DOM-based processing:
- documents can be too big to be stored in main memory, as in DOM. Such documents are often encountered in natural language processing [3], biological [4] and astronomical data [5];
- unbounded XML streams that are automatically generated (e.g. news) impose new querying approaches to be considered, where the memory usage does really matter;
Progressive processing, i.e. stream-based processing generating partial results as soon as they are available, even before the full input is completely read, gives rise to a more efficient evaluation in certain contexts:
- Selective dissemination of information(SDI), where documents are filtered before being distributed to the subscribers [6, 7];
- Data integration, where the input of slow sources can be processed before the full data is retrieved over the internet [8];
- Pipelined processing, where the input is sent through a chain of processors each of which taking the output of the preceding processor as input. A good example of pipelined processing is Cocoon [9].
XPath Rewriter
Abstract: The XPath language proposes constructs for back and forth
navigation in the XML document tree. Clearly, the backward navigation precludes efficient
evaluation of XPath expressions on XML data streams, as the evaluation of backward
navigation requires to buffer already encountered stream fragments. Therefore, rewriting
rules for XPath expressions are proposed, that can rewrite expressions involving backward
navigation into forward-only expressions.
Publications:
[OMFB02],
[OMFB01].
Time: 01.07.2001 - 01.03.2002.
Streamed XPath Evaluator
Abstract: A streamed and progressive evaluator (SPEX) of an XPath fragment against XML streams has been defined.
- SPEX is an efficient single-pass evaluator based on networks of independent deterministic pushdown transducers. Thus, it is especially suitable for implementation on devices with low-memory and simple logic as used, e.g., in mobile computing.
- The current XPath fragment processed covers all forward axes of XPath, set operations, and restricted predicates. The reverse axes can also be treated, as direct result of previous work on XPath rewriting. As predicates, are supported (arbitrarily nested) structural predicates, value comparisons and node-identity joins. The extension of the supported language to accept also variables does not raise problems.
- SPEX has a polynomial combined complexity (in the size of the stream and the query).
- A Java prototype has been implemented and shows the SPEX efficiency when comparing SPEX to current widely accepted XPath implementations, like Xalan.
Publications:
[OFB04],
[BFO03],
[OFB03],
[OKB02],
[K02].
Time:
01.10.2001 - running.
Optimizing Multiple Queries against XML Streams
Abstract: The root of this activity stems from the observation that several
XPath queries can be translated into the same SPEX network.
This activity investigates first the problem of how to find a cost-optimal multiple query
plan (SPEX network) that allows the simultaneous evaluation of multiple queries (tens of
thousands of queries of average size 10) against the same XML stream. Several heuristics are
proposed for detecting common parts of query plans for different queries and then for
sharing them in the multiple query plan. An extensive experimental evaluation shows that
sharing arbitrary operators under realistic cost functions results in multiple query plans
that have consistently lower cost for reasonable sets of queries than multiple query plans
where only common prefixes are considered. In most cases, the relative improvement is higher
than 50%. Although the time for generating such multiple query plans is higher than for
multiple query plans where only common prefixes are shared, the increase in time is within
an acceptable margin.
Publications: [F03].
Time: 01.02.2003 - 31.07.2003.
SPEX Viewer: A Graphical User Interface for SPEX
Abstract: A graphical user interface for SPEX is added to the project. In
this way, the user can see at every incoming message from the stream the processing status
in terms of current states, stack configurations, and result candidates. The graphical
interface will be available also on the Web.
Publications:
[S03].
Time: 01.10.2002 - 8.02.2003.
Approximate Streamed Evaluation of XPath under Memory Constraints
Abstract: The SPEX approach can not guarantee bounded memory usage for all
supported queries. The goal of this activity is to investigate the cases that can be safely
treated and the ones that need unbounded memory, independent of the XML stream. Strategies
for coping with the latter situation are to be investigated.
Publications: [DS03].
Time: 01.11.2002 - running.
Well-Ordered XPath Queries
Abstract:
The evaluation of XPath queries against XML streams can require in general memory linear in
the size of the input XML stream [SPEX]. This happens in
cases where potential answers to queries have to be buffered until the decision regarding
their appartenance to the result can be met. However, such a time-limited buffering is not
always possible, especially when the XML stream is unbounded or infinite. In such cases, the
evaluation of queries could require unbounded (or even infinite) memory. Several approaches
are proposed to ameliorate this situation, mainly focusing on enforcing hard memory
constraints for the query evaluation, e.g., [SAQL], at
the price of retrieving approximate instead of exact answers. It would be interesting to
identify classes of XPath queries, for which there can be always inferred a bound on the
required memory for query evaluation. More precisely, such classes of queries should be
syntactically characterizable and memory bounds should be inferrable as functions only of the
input query, or also of the input stream. In this sense, some classes can be described
independent of the structure of the stream, or dependent on the stream schema.
Well-ordered XPath Queries: As a prime target towards describing such
classes, one can envisage as a first step the characterization of XPath queries, the
evaluation of which does not require at all buffering of potential answers. Such a query
property can be detected, e.g., by checking whether all constraints (or predicates) of a
potential answer are fulfilled along the stream before encountering that potential answer,
i.e., before fully evaluating the head path that leads to the answers. However, a statical
analysis of the queries, i.e., without the inspection of the stream and before the query
evaluation, could be sufficient in some cases to detect such a query property. As early
investigation of the supervisors shows, such a property, called ad-hoc well-orderedness, could
be detected statically for some classes of queries. The intuition behind ther property name
well-orderedness resides in the informal observation that for a well-ordered query
its predicates are always evaluated before before its head path, thus there exists an
order of the evaluation of subqueries that ensures the buffer-freeness of the evaluation of
the whole query.
Time: 01.09.2003 - running.
Theses
-
[DS03]
Approximate Streamed Evaluation of XPath under Memory Constraints
. Dominik Schwald. October 2003.
Project thesis, University of Munich. -
[F03]
Optimizing Multiple Queries against XML Streams
. Tim Furche. July 2003.
Diploma thesis, University of Munich.
[.ps.gz, .pdf, .html] -
[S03]
SPEX Viewer: A Graphical User Interface for SPEX
. Markus Spannagel. February 2003.
Project thesis, University of Munich. - [K02]
Towards a Streamed XPath
Evaluation. Tobias Kiesling. June 2002.
Diploma thesis, University of Munich.
Presentations
-
Query Optimization on XML Streams. Tim Furche, June 2003
Oberseminar "Knowledge Representation and Markup Languages", University of Munich.
[.ps.gz, 318KB / .pdf, 487KB] -
Query Merging for XML Streams. Tim Furche, December 2002
Oberseminar "Knowledge Representation and Markup Languages", University of Munich.
[.ps.gz, 692KB / .pdf, 763KB] - SPEX: A1 Poster in ps.gz and pdf.
-
Querying XML Streams using Pushdown
Transducers. Tobias Kiesling and Dan Olteanu.
Oberseminar, PMS/LMU, Munich, April 2002. -
Rules for Efficient XPath Evaluation. Dan Olteanu.
Theorie und Anwendung semistrukturierter Daten, Herrshing, February 2002. -
Rules for Efficient XPath Evaluation. Dan
Olteanu.
Rule Markup Techniques, Dagstuhl Seminar no. 02061, February 2002. -
Symmetry in XPath.
Dan Olteanu, Holger Meuss, Tim Furche.
Oberseminar, PMS/LMU, Munich, October 2001. -
Enhancing XPath with Schema Information. Dan Olteanu.
Oberseminar, PMS/LMU, Munich, July 2001.
References
- [1] XML Path Language (XPath) Version 1.0. W3C Recommendation. J. Clark, S. DeRose. http://www.w3.org/TR/xpath, 1999.
- [2] Document Object Model (DOM). W3C DOM Working Group. http://www.w3.org/DOM/, 2002.
- [3] XCES: An XML-based standard for linguistic corpora. N. Ide, P. Bonhomme, L. Romary. 2nd Annual Conf. on Language Processing Resources and Evaluation, 2000.
- [4] Modeling of Biological Data. P. Kroeger. University of Munich, 2001.
- [5] NASA. Astronomical Data Center, http://adc.gsfc.nasa.gov/.
- [6] Efficient Filtering of XML Documents with XPath Expressions. C. Chan, P. Felber, M. Garofalakis, R. Rastogi. In Proc. of ICDE 2002.
- [7] Efficient Filtering of XML Documents for Selective Dissemination of Information. M. Altinel, M. Franklin. In Proc. of VLDB 2000.
- [8] Efficient Evaluation of Regular Path Expressions on Streaming XML Data. A. Levy and Z. Ives and D. Weld. http://data.cs.washington.edu/integration/x-scan/, University of Washington, 2000.
- [9] Cocoon 2.0: XML publishing framework. Apache Project. http://xml.apache.org/cocoon/index.html, 2002.
