MQSPEX

Optimizing Multiple Queries against XML Streams

Project description

Processing and querying streams, XML streams in particular, has recently become a widely recognized area of interest both in research and in industry. In the context of the “XPath Evaluation on XML Streams” Project [BFMO01], a novel evaluation model for querying XML streams based on simple deterministic pushdown transducers is proposed: SPEX (streamed and progressive evaluation of XPath queries against XML streams). These transducers, corresponding to the operators used in the underlying query, can be arranged in a very flexible manner into an evaluation network, specifying the flow of the data from the stream through the operators.

In contrast to traditional query evaluation for databases, where multiple queries against the same data can be evaluated sequentially, for a streamed environment only the simultaneous execution of multiple queries is feasible, as the sequential evaluation requires multiple passes over the stream.

This project aims at extending the basic SPEX engine with support for multiple queries. [F03] presents an overview of techniques for optimizing multiple queries posed against a stream of XML data. Building upon the SPEX query engine [K02, OKB03], the problem how to find a cost-optimal query plan that allows the simultaneous evaluation of multiple queries against the same stream is presented and shown to be not only hard to solve but also hard to approximate, if arbitrary parts, and not only common prefixes as in previous approaches, can be shared among query plans. Several heuristics are investigated and compared, in particular with respect to their complexity. Furthermore, it is shown how to extend the SPEX query engine into MQSPEX (multi-query aware SPEX) to support such query plans for multiple queries. This extension proves to be both natural and efficient. An extensive experimental evaluation shows that sharing arbitrary operators under a realistic cost function results in query plans that have consistently lower cost for reasonable sets of queries than query plans where only common prefixes are considered. In most cases, the relative improvement is higher than 50%. Although the time for generating such query plans is higher than for query plans where only common prefixes are shared, the increase in time is within an acceptable margin.

Further work in this project is focused on improving the ideas presented [F03] and providing an easily accessible prototype of MQSPEX. Possible directions for improvement are described below under future work. In the following, after an extended motivation describing exciting areas of application for the proposed techniques, the important contributions of this project are identified. Striking results of the extensive experimental evaluation given in [F03] are highlighted as well as future directions for reasearch in this project.

Motivation

Processing of data streams or sequences of blocks of data (usually called elements of the stream) has triggered rising interest, both in research [CCCC02, Y03] and in industry [C00, MB02]. Stream processing differs from conventional methods of data processing in main memory or on data bases in the requirement to process data immediately on its arrival in the stream instead of creating an appropriate data structure that is on which the actual processing is performed afterwards. In particular, if the data to be processed changes fast, only small fragments of the data have to be processed repeatedly, the data arrives at a high rate, or the amount of data is too large to be efficiently stored and processed, stream processing provides clear advantages over traditional methods based on in-memory data structures or databases. Indeed, in most applications for stream processing at least one of these characteristics can be observed, including some of the most exciting areas of application for streams:

All these areas of applications share an inherent heterogeneity of information sources in respect to the data delivered, the parameters of the provided service, or simply the administrative responsibility. For example, monitors on devices from various vendors in network analysis might generate monitoring data with differing resolution, precision and in varying intervals, the various news agencies used as sources for syndication as well as dissemination will likely focus on different stories or aspects of the same story. These heterogeneous information sources require some agreed upon means for information encoding as provided by XML [http://www.w3.org/XML/] dialects for the various areas such as NITF (News Industry Text Format, http://www.nitf.org) for news, MPEG-7 for meta-data on multimedia streams, or SensorML [B02] for sensor information and configuration. Although in some of these applications, most notably for sensor data, one commonly expects rather flat data consisting in simple attribute-value pairs—and even monitoring data sometimes requires more sophisticated modeling constructs, e.g., when representing the fused information from entire sensor networks or partial results of sensor analysis [BR02]— others such as syndication of news or meta-data processing with MPEG-7 require the richer hierarchical relations provided by XML.

Summing up, a clear need for the efficient processing of XML streams can be identified in all the above application areas. In particular, all the described applications require some capability to query the incoming stream: Sensor networks have to correlate and monitor incoming data based on specifications of interesting events (e.g., “send an alert, if the value of sensor A is above a certain limit for more than 5 minutes”). Publish-subscribe systems route data based on the subscriptions of their users, essentially filtering the stream of data with large numbers of queries, such as “send all publications containing a certain keyword both in the title and the first paragraph”. Syndication systems correlate information from different sources and often, again based on user profiles, select only potentially interesting data. Finally multimedia streams have to be filtered and routed based on queries over the meta-data, it is even possible to select only parts of a multimedia stream, e.g. “return only those scenes in the movie where a certain speaker uses a certain word”.

As noted above, one of the advantages of stream processing is the progressive delivery of result in a single pass over the stream. One-pass processing is required in particular on large or unbounded streams or if the results have to be delivered continuously even before all data has arrived, as it is common in monitoring and event notification systems. This manner of processing imposes three important challenges to a query engine:

  1. It is not feasible to look back in a stream (except possibly within a certain small window). Hence, queries containing such backward navigation either have to be rewritten, as proposed in [OMFB02], or disallowed.
  2. As a stream is always processed sequentially, traditional optimization techniques, such as database indices, that are tailored to minimize the amount of access to data and the result of intermediary results, are not applicable to streams. Here, the optimization objective is to minimize the number of operations per element in the stream rather than the number of elements visited. This shift in the optimization objective is reflected in the notion of a stream index employed in the XML Toolkit [AGOR02] that does not provide efficient access paths to data relevant for a query but rather allows to skip elements not relevant for a query reducing the number of operations for each such element to one.
  3. If several queries, such as monitoring conditions or subscriber profiles, have to be evaluated against the same stream, it is not feasible to evaluate the queries sequentially as in traditional databases but they have to be processed simultaneously.

Contributions

This project focuses on the third observation: The requirement to execute multiple queries concurrently against the same stream combined with the second observation, that the number of operations per element is crucial for fast processing of streams, there is an evident need for novel methods to optimize multiple queries for a single processing against a stream. Indeed, all proposals [AF00, CGR02, DAFZ02] of query engines for publish-subscribe systems are tailored to process large amounts of queries, but due to their application area restricted to the filtering of small self-contained documents from a stream of such documents . Among the general query engines recently proposed [GMOS03, GS03, PC03, BCGR03, OKB03] only the [GMOS03, GS03] consider the optimization of multiple queries. All these approaches have in common, that the optimization is based on common prefixes in the queries (for an in-depth comparison see [F03]). The approach developed as part of this project improves on these by considering not only prefixes but rather any kind of shared subparts techniques for multi-query optimization on XML streams in several points:

Some of these contributions are highlighted in the following to give an impression of the idea.

Query plans for XML streams

The core objective of traditional query optimization, e.g. for relational databases, is to find for a given query a set of operators together with an order in which these operators are to be applied for evaluating that query. Such a set of operators together with the specification of the data flow is refered to as a query plan. Often, one distinguishes between logical and physical query plans: For relational databases, logical query plans are usually based on the (operators and rules of the) relational algebra, whereas physical query plans entail also operators that are concerned about the actual access to the data, e.g., operators for index access, table scans, various join variants, etc. The by-standing image shows the usual stages of the optimization process in a query optimizer: The query (specified in some supported query language) is parsed and the parse tree is used to generate the initial logical query plan that in turn is optimized according to the rules of the logical algebra. This optimized version of the logical query plan is the basis for selecting the prefered physical plan which in turn is used for the actual query evaluation.

[F03] proposes a logical query algebra for querying XML streams with relation operators similar to the XPath axes: Two elements in the stream are related, e.g., by the child operator, if one is the child of the other. Furthermore, elements can be queried by their properties, such as the label of an element or the text contained in an element.

The following two images show query plans for different but similar queries: In the first figure, all elements from the stream (indicated by the access operator in) are first processed by the child relation operator, that retains only elements that are child of some other element. From these elements only those that have also an “a” label are retained by the next operator, etc. Where the first figure selects all “c” descendants of following-siblings of an “a”, if that “a” is a child of some element and has itself a “b” child with content “v”, the second figure selects all “c” descendants of children of such “a”s.

One of the most important techniques for optimizing queries against streams, in particular XML streams, is the sharing of operators. In particular, if several queries shall be evaluated simultaneously, operator sharing can result in considerable increase in processing time. For the two queries from above, the following two pictures show possible common query plans, where the first one only shares operators in a prefix of the query plan, whereas the second one considers all operators for sharing.

MCSP Problem

The problem how to compute the minimal common query plans for several queries (together with a evaluation strategy for each query represented as a query plan) is formally defined as an NP optimization problem MCSP (minimum common super-plan) in [F03]. Clearly, the difficulty of computing a common query plan depends highly on the kind of graphs that are valid query plans under a certain evaluation model. The cost estimation for a query plan depends on the evaluation model as well. It can be shown, that the general case (where no assumptions about the evaluation model can be made) is NP-hard and NPO PB-complete, i.e., can not be approximated within nε for any ε > 0. It is an open issue, whether for certain (interesting) evaluation models better complexity and approximability results can be obtained.

A large set of heuristics of the approximating the MCSP problem are proposed and compared in [F03]. Two classes of such heuristic algorithms are identified based on differing assumptions about the underlying evaluation model: The first set assumes that cost estimation and validity check for query plans can be extended to partial solutions of the MCSP, the second set assumes that there is some algorithm for transforming one common query plan for multiple queries into another common query plan. Theoretical analysis of the algorithms reveals, that the first assumptions allows for slightly more efficient algorithms than the second one, but might be harder to fullfil.

This theoretical result is uphold in the experimental evaluation (see below).

Multi-query SPEX

For the experimental evaluation, the SPEX evaluation engine has been extended to support the evaluation of multiple queries according to a given common query plan (as computed by one of the heuristics mentioned above).

The SPEX evaluation engine is based on a network of simple transducers each implementing a certain operator from the query plans shown above. These networks resemble very closely the shape of a query plan with a determination network added at the sinks that shadows the structure of the query plan. This determination network is used for “determining”, whether all predicates are fullfilled: A predicate is indicated by the predicate operator [ ] and has the semantics, that each of the sinks after that operator has to yield result at least once for the same element selected by the predicate operator. This test is performed by the determination network as shown in the following figure: Only if both the “c” and the “d” label operator yield result for the same “b” the determination operator ?2 accepts that result. For more details, please refer to [F03].

The extension of the SPEX evaluation engine to multi-query plans proves to be very natural and easy: Consider the multi-query plan shown in the following figure. In this query plan the two queries 1 and 2 share a common path in the middle of the query plan.

In such a query plan, a new operator, referred to as shared path begin operator and depicted as a box labeled S, is introduced at the point where the shared path starts. When a selected element flows through this operator from one of the incoming edges it is annotated with a condition. It is furthermore recorded for which of the incoming edges (or queries), a selected element is encountered, since it is very possible (in this case actually guaranteed) that the element is selected only on one of the incoming edges. Here, an element is selected either by query 1, if it is an “a” element, by query 2, if it is an “b” element, or by none of them.

In the determination network at the end of the query plan, the queries are separated again if necessary and for each query a determination operator is used that determines a condition only, if a selected element has been encountered for that query by the corresponding shared path begin operator when the condition was created.

Interestingly, this extension to the SPEX network proves to be very efficient: The overhead introduced by sharing a path is not larger than the cost of a predicate operator and occurs only at the beginning of the path. The more costly the shared path is (e.g., if it is long or entails expensive operators), the lower is the overhead of sharing per operator.

Experimental results

A prototype implementation of the proposed query plans and algorithms for solving the MCSP in Java [http://java.sun.com] based on a graph library specifically optimized for efficient iteration over the incident edges of a vertex, efficient implementations of the most important graph operations for a query plan are provided. The following tests have all been performed using Sun Hotspot JRE 1.4.1 under Linux 2.2 running on a AMD Athlon with 1,3 GHz and 500 MB of main memory.

Several workloads of 10000 tree-shaped queries each simulating common application scenarios have been used for these tests: Two of the workloads are using queries generated by an automated query generator that are all confirming to a given DTD (in the respect that the queries do not violate any of the constraints specified in the DTD). In the following, only two workloads consisting in queries generated from a rather small DTD allowing only slight variations (COURSES-workload), from a complex, highly recursive DTD (NITF-workload), and a third one with entirely random queries (RANDOM-workload) are considered. The following figures give an idea of the kind of queries contained in one of the workloads.

The following figures show the resulting query plans for the first 2, 5, 10, and 50 queries from the COURSES-workload with the two most interesting heuristics: The first four figures show the result of the tree prefix merger, which consideres only prefixes of tree-shaped queries for sharing, the remaining four depict the outcome of the initial greedy merger considering also non-prefix nodes for sharing using a simple greedy heuristic. Both merger fall into the first class of heuristics as discussed above.


In the following, the tree prefix merger and the initial greedy merger are compared over all three workloads with two additional mergers for reference, the random merger that selects a random but valid merging and the plain merger that constructs a common query plan without sharing any operators. For all workloads and mergers, the cost of the resulting query plan (estimated by some cost function, cf. [F03]) and the time for computing the common query plan are depicted.

On the COURSES-workload, the initial greedy pair merger produces by far the best solution in a reasonable time. Note in particular, that the random merger (as on all workloads) performs far worse not only in cost but also in time. This is due to the fact, that the common query plan generated by the random merger increases much faster in size when more queries are considered than in the case of the initial greedy merger who tries to find possible sharings. Therefore, determining a valid solution becomes increasingly harder for the random merger.

For the NITF-workload, a similar picture is obtained, but due to the higher complexity of the DTD and therefore the increased diversity in queries, the time for both initial greedy and random merger increases considerably. Nevertheless, the initial greedy merger still offers the best trade-off between cost of generated solution and time for generation.

The RANDOM-workload is not so much tailored to a certain application area, but rather showing a theoretical worst-case, since in all practical applications a higher degree of similarity among the queries can reasonably be expected. As expected, neither of the mergers can improve the solution quality much in comparison to the plain merger. The random merger actually constructs a solution that is worse than when no operators are shared (this is possible, as any merging introduces some overhead as discussed above that is reflected by the cost function).

Summing up, the initial greedy merger clearly outperforms the tree prefix merger that is mimicing previous approaches for multi-query optimization on all realistic workloads w.r.t. the quality of the solution produced. Although, the time for generating a solution is slightly higher for the initial greedy merger, the increase is acceptable as the common query plan can be precomputed during query compilation.

Future Work

For the future, several open issues are worth investigating in the context of this project:

References


This page: http://www.pms.informatik.uni-muenchen.de/forschung/spex/mq.html
Author: Tim Furche, Dan Olteanu
Last modified: 2003/08/22
Valid XHTML 1.0! | Valid CSS! | Level Double-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0