Implementing a 15x faster JavaScript SHACL Engine

One of my projects encountered a performance problem with the available SHACL engines. Besides, I needed to know which triples had been processed (coverage/fragments), so I implemented a new SHACL engine in JavaScript from scratch.

Performance issue

I collect a lot of data from my IoT devices at home. Everything is mapped into RDF and stored on a triplestore. Since sensors sometimes generate garbage data, I decided to use SHACL shapes to ensure data quality. Everything worked fine on my PC, but I noticed a performance issue on my Raspberry Pi, where I’m running some other services. I compared the performance of the rdf-validate-shacl JavaScript library with the pySHACL Python library and got similar results.

Profiling and analysis

After having a closer look at the rdf-validate-shacl library with a profiler, I found many calls to Dataset.match. As I went through the code, it was clear what was going on: Each time a shape was processed, all properties and values were fetched from the Dataset object. And that happens quite often for the same shape. The library is based on the reference implementation, and for this case, it makes sense to use the most straightforward solution to match the logic of the code with the specification. However, I wanted a production-ready library.

Implementation

Production-ready doesn’t mean readability will be ignored, and the shortest path to better performance will be used. I looked for a structure that fits best for the library and decided to implement an engine that creates shapes with a compile step.

First, it goes through all properties of a shape, finds the matching compile function, calls the compile function, and adds the returned function to the shape. The compile function takes care to read all required arguments from the dataset and, if required, preprocesses the data. A new function is returned that binds all the arguments and doesn’t require a shape context. The generic properties like message and severity are implemented as getter functions. The first time the getter function is called, the value is fetched from the dataset. The following calls return a cached value.

Below is the code for the sh:maxCount constraint. The compileMaxCount function takes care to read the object value and converts it to an integer value. The validateMaxCountProperty function binds the given maxCount argument and returns the validation function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function compileMaxCount (shape) {
const maxCount = parseInt(shape.ptr.out([ns.sh.maxCount]).value)

return {
property: validateMaxCountProperty(maxCount)
}
}

function validateMaxCountProperty (maxCount) {
return context => {
context.test(context.values.terms.length <= maxCount, ns.sh.MaxCountConstraintComponent, {
args: { maxCount },
message: [context.factory.literal('More than {$maxCount} values')]
})
}
}

The validation code just needs to go through all compiled validation functions. Nested shapes are compiled on demand. This approach adds almost zero overhead for one-time validations but gives a significant performance boost if the validation of a shape is called multiple times.

Benchmark

The following picture shows the runtime required to validate the SHACL Shapes to Validate Shapes Graphs from the specification by itself. Three different engines are included in this comparison:

shacl-engine was tested with different configurations:

  • +c: Coverage enabled.
  • +k: The validator instance has been constructed only once, and the instance has been reused for all loops.

Each configuration was processed 100x times.

benchmark results

It’s surprising, but keeping the validator instance took more time. I’m not sure why it is like that, but modern VMs are a complex topic.

engine mean median factor
pyshacl643.13640.9416.08
rdf-validate-shacl632.39629.4715.79
rdf-validate-shacl +k717.74722.0318.12
shacl-engine40.2839.851.00
shacl-engine +c50.6850.031.26
shacl-engine +c +k59.9259.841.50
shacl-engine +k42.2641.571.04

The table includes the factor compared to the shacl-engine run, which was the fasted of all configurations. As you can see, the performance of pySHACL and rdf-validate-shacl is very similar, and shacl-engine is about 15x faster.

Coverage/fragments

Another problem, maybe some of you have already stumbled over, is a failing SHACL target generates a report where everything is fine. For this use case, knowing which triples have been matched by the shape would be helpful. It could also be used to create sub-graphs based on shapes as a Concise Bounded Description (CBD) alternative with more control.

shacl-engine implements such a feature. The information of the triples matched by a specific shape is stored in the details of the report. That can be very useful for debugging. There are much more details related to that feature that would be worth talking about, like the library called grapoi I wrote for graph traversing, which keeps track of the traversed path, but that will be a topic for another blog post.

I got a hint on the SHACL Discord server that there is a fork of pySHACL that implements a similar feature. They call it shape fragments. The fork can be found here, and the authors also published a paper.

Playground

The playground application contains two text editors to copy and edit SHACL shapes and data in Turtle or JSON-LD format. The shapes and the report can be explored with a tree viewer. The coverage is shown in a text editor and visualized in a graph network component which is quite helpful for debugging. It’s a quick and dirty application, and the UI has a lot of space for improvements, but its functionality improved my process of creating complex shapes.

Thanks

I want to thank Holger Knublauch, who implemented the reference implementation, and Martin Maillard for porting the reference implementation to RDF/JS. It was good to have the other implementations to test edge cases.

Comments

For comments, please follow the GitHub link.