The State of the UNION: Why no SPARQL Condition Should be Second Class

From korrekt.org

(Redirected from The State of the UNION)
Jump to: navigation, search
28 Apr 2011

 

While implementing RDF-based query answering for Semantic MediaWiki (SMW), I realized that the popular SPARQL query language for RDF has what I consider to be a major design flaw. Admittedly, I had been slightly too optimistic: surely, moving from SQL to SPARQL would be a piece of cake! Have we not had exactly the same RDF-ish data model in mind when designing SMW? And is SMW not a natural friend to all of its fellow Semantic Web technologies? Well, let's just say that the friendship has cooled down a bit in certain matters …

How could it come to this? The problem is «or», or rather the absence of it in SPARQL. Now the reader who knows SPARQL will wildly object: does not SPARQL have more than enough features to express disjunctions in queries? «More than enough» is indeed quite accurate, for when it comes to disjunctive queries, SPARQL has not one unified approach but two different approaches: UNION and ||. Unfortunately, these two ways of saying «or» have a different style of interpretation: one only works with disjunctions of graph patterns, the other works only with disjunctions of filter expressions.

Contents

Background: SPARQL for Beginners

Readers who have not worked much with SPARQL might be puzzled by this distinction; this is actually part of the problem. But it is quickly explained (readers who need more of an intro to RDF and SPARQL may want to get a textbook). At its heart, SPARQL is an RDF query language that is based on pattern matching. Query patterns look just like data, but with variables in place of some concrete values. An answer to the query then instantiates all variables in such a way that the instantiated query matches a snippet of the stored data. For example, the next query finds all things located in the USA:

SELECT ?x WHERE { ?x locatedIn USA }

Easy-peasy, and really the idea behind most query languages. Alas, patterns are not enough in practice. Sometimes one wants to restrict results further, e.g. to compare values with each other. SPARQL uses filters for this:

SELECT ?x WHERE {
  ?x population ?number
  FILTER( ?number>1000000 )
}

searches for things with more than one million inhabitants. This makes a lot of sense when you think about it. After all, the RDF language that patterns are based on does not have any features like this, so we need some new syntax. And there are many filters: numeric comparisons, string-pattern searches, type checks, you name it.

Two Disjunctions, Both Alike in Dignity

SPARQL allows us to express disjunctions between graph patterns using UNION:

SELECT ?x WHERE {
 { ?x locatedIn USA }
 UNION
 { ?x locatedIn France }
}

finds all things that are either in France or in the USA. SPARQL also can express disjunctions in filter conditions using ||:

SELECT ?x WHERE {
  ?x population ?number
  FILTER( ?number>1000000 || ?x=Luxembourg )
}

finds all things that have over a million inhabitants, or that are Luxembourg. Note how a filter can express equality. So why not get rid of the equality, moving it to the pattern, and use UNION as above?

SELECT ?x WHERE {
 { ?x population ?number FILTER( ?number>1000000 ) }
 UNION
 { Luxembourg population ?number }
}

This is not what we wanted: if we do it like this, Luxembourg can no longer appear as a variable binding for ?x. Conclusion: even if "=" filters can be expressed in patterns, we sometimes need to use filters for getting the right results. Slowly, we begin to realize that there will be trouble.

A Class Society of Query Conditions

What will happen if we have a query condition that disjunctively combines a filter and a pattern? Consider the task:

«Find all ?x locatedIn ?y, where ?y is the USA or a state of the USA.»

A natural way for a users to enter this into a system is to say «I am looking for ?x that are locatedIn ?y, where ?y has one of the following properties:

  • it is equal to the USA, or
  • it is a stateOf the USA.»

So what we get is a disjunctive property about ?y that does not involve ?x at all. So we would like to write this as:

SELECT ?x WHERE {
  ?x locatedIn ?y .
 { FILTER( ?y = USA )  UNION  ?y stateOf USA }
}

Unfortunately, this is not a valid query in SPARQL (or at least yields errors in 4Store and Virtuoso, and would not give the right results). The reason is that FILTER does not work as a pattern, only as a «second class» side condition to patterns. One could work around this by adding some pattern that effectively states "?y exists" to the first pattern in the UNION:

SELECT ?x WHERE {
  ?x locatedIn ?y .
 { { ?y ?p ?o FILTER( ?y = USA ) }  UNION  ?y stateOf USA }
}

Assuming that all things in my DB have some property ?p (e.g. rdf:type or rdfs:label), this would do the trick. But this is not a clean solution, and surely does not make anything easier for the database that will now have to search for values of ?p and ?o quite unnecessarily.

A Simpler Solution, and Why it is Wrong

Again, routine SPARQLers will protest. Isn't there a much more straightforward way of writing the query?

SELECT ?x WHERE {
 { ?x locatedIn USA }
 UNION
 { ?x locatedIn ?y . ?y stateOf USA }
}

Yes, this would work. But it does not solve our problem, just the example I gave. It only works since "=" is equivalent to identity for our purposes. If our FILTER condition would be different (e.g. find anything with "USA" in the URI string) then there is no such simple solution.

Moreover, even for = the above solution has a fundamental flaw. Assume that the query requires many things about ?y, not just one disjunction, e.g. the user also wants that «?y has at least 10 million inhabitants or is in the pacific». Contrived, maybe, but if you write a system like SMW you cannot exclude query types just because you had no plausible toy examples ready for them. We would like to write this as:

SELECT ?x WHERE {
 ?x locatedIn ?y
 { { FILTER( ?y = USA ) }
   UNION
   { ?y stateOf USA }
 }
 { { ?y population ?num FILTER( ?num > 10000000 ) }
   UNION
   { ?y locatedIn Pacific }
 }
}

Yes, we have two independent UNION here, that was the point of the example. Again, this is not a valid query, but this time we have a real problem expressing it without the «?y ?p ?o» hack used above. The knowledgeable reader can surely work out how to do it, but the result is quite a bit more complex:

SELECT ?x WHERE {
 { ?x locatedIn USA . USA population ?num FILTER( ?num > 10000000 ) }
UNION
 { ?x locatedIn USA . USA locatedIn Pacific }
UNION
 { ?x locatedIn ?y . ?y population ?num FILTER( ?num > 10000000 ) }
UNION 
 { ?x locatedIn ?y . ?y locatedIn Pacific }
}

Looking at this carefully, we see that the conditions of our individual disjunctions have multiplied: we have combined the conditions of both disjunctions in all possible ways. This is clearly an exponential transformation that will very quickly lead to very large queries. Not good.

Doing it the Hard (but Correct) Way

In fact, there is another way of combining filter and pattern disjunctions in SPARQL. It is not pretty, so let us go back to our simpler example without the additional disjunction. What we can write then is:

SELECT ?x WHERE {
  ?x locatedIn ?y .
 OPTIONAL { ?yaux stateOf USA }
 FILTER ( ?y = USA || ?y = ?yaux )
}

This needs to be explained. OPTIONAL is SPARQL's way of adding optional patterns to a query. So it will try to find a suitable match for our helper variable ?yaux, but it will not fail the query if there is none. So what we are looking for in the query is an ?x locatedIn ?y where ?y is either the USA, or ?y equals ?yaux which in turn must be a stateOf the USA. If the optional part is not found, then ?yaux is unbound and can never be equal to (bound) ?y.

So this beast of a query does the right thing. Now if you had more disjunctive conditions, you could add them to the OPTIONAL (with UNION) or to the FILTER (with ||). More conjunctive conditions can simply be added next to ?x locatedIn ?y. To conjunctively add another disjunction (like the construed example above), one would just add another OPTIONAL-FILTER group (using a new auxiliary variable). So we can easily see that this method allows for an easy to implement, modular, linear-time translation of conditions to SPARQL.

It also looks like something that any RDF store would have a very hard time to optimize since it completely hides the simplicity of the actual query, as the user would phrase it.

Happy or Tragic Ending?

The developer who wishes to use SPARQL for a wide range of complex queries now has three options:

  • implement the hack above (use alibi pattern conditions just to allow filters),
  • implement the cumbersome translation above (OPTIONAL-FILTER),
  • explain to her users which conditions are expressed as patterns and which as filters, and that one must not combine the two in disjunctions.

None of this seems to be fully satisfactory. So it remains in the hands of the current and future writers of the SPARQL specification to finally give filters the right to stand on their own in group graph patterns. This might turn out to be an easier task than it seems. The main change is that filters are no longer used as a post-processing mechanism, but like «special graph patterns» that can produce their own matches. I strongly expect that this is what most implementations do anyway since it would not be efficient to blindly retrieve all results and check the filter conditions only later. Concretely, one would need to make the following modifications:

  • change the formal semantics to treat FILTERs like patterns, not like post-processing features,
  • update query engines to support filters without pattern according to the new semantics; engines that have no optimization that works like this already can simply add a dummy pattern that matches all RDF elements for each of the variables in the filtered expression (inefficient but correct, and my above solutions would not be more efficient than this either).

It should be noted that this simplifies the algebra of SPARQL considerably. In particular, the Filter function would work like the BGP function, simply taking another type of input (a filter condition instead of a list of triples). All other functions would treat Filter and BGP in the same unified way since they would indeed return the same kinds of results. LeftJoin would no longer need an additional third argument. Interestingly, the change would still be largely backwards compatible: queries where all variables also occur in a pattern would lead to the same results as before. Only "unbound" variables in FILTERs would now be instantiated with all values (in the current graph) that they possibly match, instead of assuming the value "unbound" for such variables and "applying" the filter to this. This also means that SPARQL processors can support this change without changing the results of normal queries.

So what keeps us from doing it now? As always when abolishing outdated norms, many objections can be found, including «this is not in our working group charter», «we are already too late with the next update of the standard anyway», and «do you want us to change all our SPARQL implementations again?» Each of this is a valid argument to delay this change, but it seems unavoidable to make it eventually. No good technical reason appears to prevent us from doing this rather sooner than later – and abolish a two class society of query conditions for the sake of a first class query language.

enNote:The State of the UNION: Why no SPARQL Condition Should be Second Class


Personal tools