SPARQL Named Query

Once your SPARQL queries get bigger, you may stumble over the problem that you have duplicate parts of the query or have to deal with performance impacts. Federated queries are affected by some more constraints. The SPARQL Named Query proposal allows the explicit reuse of sub-queries. This blog post will describe the problem in more detail, how the SPARQL Named Query can solve it, and how you can try it already.

Let’s have a look at the problem based on an example: We want to know all movies with narrative or filming location in the capital of Bavaria. First, we search for the capital of Bavaria:

1
2
?city wdt:P31 wd:Q515; # city
wdt:P1376 wd:Q980. # capital of bavaria

Then we combine two sub-queries with a UNION to get the movies:

1
2
3
4
5
6
7
{
?movie wdt:P31 wd:Q11424; # film
wdt:P840 ?city. # narrative location
} UNION {
?movie wdt:P31 wd:Q11424; # film
wdt:P915 ?city. # filming location
}

And that’s our complete query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PREFIX bd: <http://www.bigdata.com/rdf#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>

SELECT ?city ?cityLabel ?movie ?movieLabel WHERE {
?city wdt:P31 wd:Q515; # city
wdt:P1376 wd:Q980. # capital of bavaria

{
?movie wdt:P31 wd:Q11424; # film
wdt:P840 ?city. # narrative location
} UNION {
?movie wdt:P31 wd:Q11424; # film
wdt:P915 ?city. # filming location
}

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 100

Run the query on Wikidata

Based on the SPARQL specification, the UNION queries would be processed first, and then the result would be joined with the part that identifies the capital of Bavaria. A query optimizer may run it the other way round, which would speed up the UNION queries, but you can’t rely on that. Explicitly placing it inside the UNION sub-queries would be another option:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
PREFIX bd: <http://www.bigdata.com/rdf#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>

SELECT ?city ?cityLabel ?movie ?movieLabel WHERE {
{
?city wdt:P31 wd:Q515; # city
wdt:P1376 wd:Q980. # capital of bavaria

?movie wdt:P31 wd:Q11424; # film
wdt:P840 ?city. # narrative location
} UNION {
?city wdt:P31 wd:Q515; # city
wdt:P1376 wd:Q980. # capital of bavaria

?movie wdt:P31 wd:Q11424; # film
wdt:P915 ?city. # filming location
}

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 100

Run the query on Wikidata

SPARQL Named Query let you define the query once and import the result into the UNION sub-queries with VALUES FROM. The final query would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
PREFIX bd: <http://www.bigdata.com/rdf#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>

SELECT ?city ?cityLabel ?movie ?movieLabel WHERE {
QUERY ?cityQuery {
SELECT ?city WHERE {
?city wdt:P31 wd:Q515; # city
wdt:P1376 wd:Q980. # capital of bavaria
}
}

{
VALUES (?city) FROM ?cityQuery

?movie wdt:P31 wd:Q11424; # film
wdt:P840 ?city. # narrative location

} UNION {
VALUES (?city) FROM ?cityQuery

?movie wdt:P31 wd:Q11424; # film
wdt:P915 ?city. # filming location
}

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 100

You may say that you don’t need it because you only work on SPARQL endpoints with a very well-tweaked query optimizer. But with federated queries, the story gets more complicated. The specification mentions that “an implementation of a query planner for federated queries may decide to decompose the query into two queries instead”, but also “Many existing SPARQL endpoints have restrictions in the number of results they return and may miss the ones matching”. So based on the behavior of the query planner, the result could be different. With SPARQL Named Query it’s possible to enforce reducing the query result set on the remote endpoint, which decreases the risk of wrong results caused by a result limit.

Here is an example where data from Wikidata and WikiPathways is combined. Pathways, where the label contains the string vitamin, are identified on the Wikidata side. From WikiPathways, annotations are fetched and joined:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wp: <http://vocabularies.wikipathways.org/wp#>

SELECT DISTINCT ?item ?pw_annotation ?annotation_label WHERE {
?item wdt:P2410 ?wpid;
wdt:P2888 ?source_pathway;
rdfs:label ?label.

FILTER(CONTAINS(LCASE(?label), "vitamin"))

SERVICE <http://sparql.wikipathways.org/sparql> {
?wp_pathway
dc:identifier ?source_pathway;
wp:ontologyTag ?pw_annotation.
?pw_annotation rdfs:label ?annotation_label.
}
} LIMIT 100

Run the query on Wikidata

And the same query with SPARQL Named Query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wp: <http://vocabularies.wikipathways.org/wp#>

SELECT DISTINCT ?item ?pw_annotation ?annotation_label WHERE {
QUERY ?query {
SELECT ?item ?source_pathway WHERE {
?item wdt:P2410 ?wpid;
wdt:P2888 ?source_pathway;
rdfs:label ?label.

FILTER(CONTAINS(LCASE(?label), "vitamin"))
}
}

SERVICE <http://sparql.wikipathways.org/sparql> {
VALUES (?item ?source_pathway) FROM ?query

?wp_pathway dc:identifier ?source_pathway .
?wp_pathway wp:ontologyTag ?pw_annotation .
?pw_annotation rdfs:label ?annotation_label .
}
} LIMIT 100

Run the query on the SPARQL Named Query Web application

One should be careful with performance comparisons on a public endpoint, but I got very consistent results:

  • standard query: ~5s
  • query with SPARQL Named Query: ~1.5s

I don’t have access to the intermediate query and result, but I guess the additional time is required to process more results on the remote endpoint and for handing them over to the local endpoint.

If you follow the links of the federated query example, you have already stumbled over the Web application, which does a client-side query translation and processing. You can find the code in the sparql-named-query repository. It also includes a command line tool.

This post and the code covers only SELECT queries. The concept could be extended to create graphs on-the-fly with CREATE and DESCRIBE queries. The result could be accessed anywhere a Named Graph is addressed.

Comments

For comments, please follow the GitHub link.