Tuesday October 13, 2009

Improving the Conciseness of Turtle and SPARQL

RDF/XML, the "official" serialization format for RDF (Resource Description Framework) was never designed for use by humans. Turtle (Terse RDF Triple Language) is a great improvement, but it's still a bit verbose for my tastes. SPARQL, being largely modeled after Turtle, shares many of its limitations.

A Look at Turtle

Turtle is a DSL (domain-specific language) for RDF. Several features help to make it concise:

  • RDF-oriented syntax

    Unlike RDF/XML, Turtle is not using a specialization (ie, dialect) of a general-purpose (aka "Digital Tupperware") format. So, many syntax elements and structural levels simply disappear.

  • @base and @prefix directives

    These directives provide a convenient, if limited, mechanism for shortening URIs (Uniform Resource Identifiers). So, many of the longest tokens are significantly shortened.

  • comma and semi-colon symbols

    The comma and semi-colon symbols allow ways to reduce explicit repetition in triples. The semi-colon gets rid of repeated subjects; the comma gets rid of repeated subject/predicate pairs.

Unfortunately, the syntax can still be needlessly verbose. Consider this example code from Chapter 4 of Semantic Web Programming:
  @prefix    ex:              <>.
  ex:Mammal  rdf:type         owl:Class.
  ex:Canine  rdf:type         owl:Class;
             rdfs:subClassOf  ex:Mammal.
  ex:Human   rdf:type         owl:Class;
             rdfs:subClassOf  ex:Mammal.
Clearly, the "ex:" prefix and the semi-colon help, but why are we repeating so much information? Combining the comma symbol with a bit of OWL magic gives us:
  h:irs_sCO  owl:inverseOf    rdfs:subClassOf.
  h:ir_Type  owl:inverseOf    rdf:type.

  @prefix    ex:              <>.
  owl:Class  h:irs_sCO        ex:Canine, ex:Human, ex:Mammal.
  ex:Mammal  h:ir_Type        ex:Canine, ex:Human.
Given that the "h" (helper) predicates can be defined elsewhere, this gives quite a reduction in visible code size and apparent complexity. But even this is a bit redundant. Why do we need to say that "ex:Canine" is a subclass of "Class"? Doesn't the "rdfs:subClassOf" predicate imply this?

Another minor annoyance is the fact that @prefix definitions can't be used in defining other ones. So, we get in-line repetition of the form:

  @prefix    ex_foo:          <>.
  @prefix    ex_bar:          <>.
  @prefix    ex_baz:          <>.
The repetition of "" is needlessly verbose. Worse, it violates the Don't Repeat Yourself (DRY) principle, formally stated as:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
So, for example, if the URL needs to be changed, large numbers of lines may need editing...

Idioms, Patterns, etc.

Every programming language produces a collection of programming idioms and design patterns. However, it has been observed that many programming design patterns are simply workarounds for limitations of specific programming languages. In RDF, such patterns are seen in expressions of complex relationships (eg, "second cousin") and multi-way relationships (eg, "John drove his car to Boston on Thursday").

Imagine a DSL that could express such concepts simply and directly, with last-minute translation into RDF. Aside from easing the burden on humans, this could make the system less brittle, because the translations could be modified at any time. This general approach (eg, functions, macros, methods, templates) has worked well in other areas of computer engineering; it seems reasonable to look into it for RDF.

Expressing multi-way relationships (ie, N-ary predicates) is awkward in RDF, because they have to be mapped into sets of binary relationships. There are several languages which handle N-ary predicates, including Common logic, Conceptual Graphs, and Object-Role Modeling. Perhaps one of these could be a starting point for a DSL.

As long as we're asking for a pony, wouldn't it be nice to use the same DSL syntax in rules, statements, and queries? SPARQL and SPIN have some interesting notions for this sort of thing; let's see what we can borrow from them.

Constraints and Possible Solutions

The designers of RDF triplestores already have daunting challenges to handle. So, we need to leave the basic storage model of triplestores alone. However, we are free to use a DSL for editing, then process it into triples (eg, RDF/XML, Turtle) for loading, etc. Following this general approach, here is a sampling of possible techniques.

Macro and/or Template Processors

Macro processors (eg, cpp, m4) have been used for decades to solve problems of this sort. More recently, template processors (eg, eRuby) have found their way into use in code generation. Unfortunately, neither of these techniques yields flexible, attractive DSLs. So, current solutions tend to be based on dedicated translators or embedded (ie, language-based) DSLs.

Dedicated Translators

A dedicated translator can bring a great deal of processing power to the task. For example, it can use a specially-crafted parser, code generator, etc. This is a bit of a heavy-weight solution, however, so let's leave it as a last resort.

Language-based DSLs

Concise programming languages such as Ruby and Scala are commonly extended with language-based DSLs. Most instances of this generate code in the host language, but some do not. The Erector project, for example, generates HTML by means of some carefully-contrived Ruby classes.


Here are some resources that may be interesting and/or useful...

Post a comment

Note: All comments are subject to approval. Spam will be deleted before anyone ever sees it. Excessive use of URLs may cause comments to be auto-junked. You have been warned.

Any posted comments will be viewable by all visitors. Please try to stay relevant ;-) If you simply want to say something to me, please send me email.