A look at SPARQL – SQL for the Semantic Web

SPARQL is the query language for Linked Data and the Semantic Web. It provides new capabilities that you simply cannot get out of traditional SQL and its power to unearth knowledge is amazing. With it, you can perform a distributed or federated query across multiple databases with a single query statement. Because SPARQL endpoints may exist on the World Wide Web as well as within your corporate enterprise, your own data can be augmented and extended by graphs upon graphs of information where as little as one single link was made. You can find and explore relationships in your company data, as well as world data, that you didn’t even know existed – all without the need for schema knowledge or some brain-numbing entity-relationship diagram postered on your wall. SPARQL is just one of a handful of tools that are transforming the current Web of documents into tomorrow’s “semantic” Web of linked data. In this post, I provide a very light introduction to SPARQL with a few simple queries you can cut your teeth on.

RDF and the surprisingly simple, but infinitely powerful triple

Linked Data is based on the RDF data model – a model in which simple assertions are made with statements called “triples”. A triple has a structure like a simple sentence we might have learned about in grade-school (e.g. “Jack is a friend of Jill.”). It is comprised of asubjectpredicate, and object.

Believe it or not, that’s really all we need to describe anything and everything known to man; the known universe can be described with triples. OK, if you want to get philosophical about it, we’d need a few axiomatic concepts to begin with, but let’s not get philosophical. A triple is all we need to say something meaningful…

That’s it. Unless Jill’s pissed-off at Jack, that’s a fact; that’s knowledge. Only, in RDF, it doesn’t really look like that. In RDF, triples are expressed with Uniform Resource Identifiers (URIs), which can be globally unique on the Web, so it looks more like this…

To make things easier, in RDF we can also use prefixed names for URIs as long as we pre-define the prefix.

@prefix social: <http://example.org/models/social#>.

The object of a triple may be a URI that refers to a resource, but it can also be a literalvalue – a string, integer, boolean, etc.

Schema-less RDF storage

RDF triples are stored in what’s commonly referred to as a triple-store. A triple-store is to SPARQL what a relational-database is to SQL. But since all we need to express a triple are three URIs, or two URI’s and a literal value, we have no need for a pre-defined schema in order to put new data in. Pause for a second and let the gravity of that statement sink in, please. And because we don’t need a pre-defined schema, we can store new relationships and new kinds of information about things even though we were unable to predict all the possible needs of our application when we first developed it. An application that uses a triple-store can therefore be continuously enhanced and extended with absolutely no need for database changes and significantly less code changes and deployments.

Linked Data applications can significantly reduce application design, development, and integration costs.

Wikipedia has a pretty good list of vendor implementations for triple-store support. Among them, for example, are AlegroGraphMulgaraOpenLink VirtuosoSesameOracle, and IBM DB2 (NoSQL Graph).

So, if you’re familiar with SQL databases, you might think of all the triples in a triple-store as a flat list of three-column records. However, because a triple links one concept with another, and because all the triples can be linked to one another, what we really end up with is a graph, which we can enter at any point, and bounce around like a cockroach on a hot griddle.

It is important to note that just because a triple-store does not require a schema, it doesn’t mean that we shouldn’t have a general model for the information we’re storing. We’re still storing information about certain kinds of things, which have common properties and relationships. We don’t have to standardize on those classes and attributes, but if we do, our data is much easier to query and to integrate with data conforming to other standard models. The model for a particular domain of knowledge is called a vocabulary or anontology. An ontology defines the classes of things you’re going to be storing information about as well as the properties they have and the relationships that can hold among them. You can create your own or you can use other, well defined vocabularies available online. It’s always a best-practice to use existing vocabularies when they suit your needs, rather than invent your own. You can also leverage existing vocabularies, but extend them to better suit the specific needs of your organization.

SPARQL endpoints

The RDF graph within a triple-store is then exposed with a standard interface made accessible on the web, called a SPARQL endpoint. A SPARQL endpoint accepts queries and returns results via HTTP. This is a machine-friendly interface to a knowledge base; although there is typically also a user interface for querying the data. Generic endpoints can query any Web-accessible RDF data and specific endpoints are wired to query against particular datasets. So, let’s take a look at a specific SPARQL endpoint now that’s very popular.

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia. Because it is exposed as a SPARQL endpoint, we can use it to ask interesting questions about the information in Wikipedia. So, let’s get our feet wet, shall we? Point your web browser here: http://dbpedia.org/snorql/, and submit the following query:

SELECT ?p ?o WHERE {
     <http://dbpedia.org/resource/IBM> ?p ?o .
} LIMIT 200

Enter this query for me

In this query, we are saying, “Give me all the predicates (?p) and all of the objects (?o) where the URI in the subject is http://dbpedia.org/resource/IBM. Or, in other words, “show me all the properties and values of your resource, IBM.” The results, should look something like this when you select ‘Browse’ for the results format; lots of properties and values…

An interesting property, to me is dbpedia:ontology/numberOfEmployees. At the time of this writing, the value was 434,246 employees. That seems like a big number, but I’m not entirely sure. Before I say, “wow,” I want to compare with some other companies that I think of as being ‘big’. I can easily just replace the subject in the query with another company, such as the shipping, freight, logistics and supply chain management company, UPS.

SELECT ?p ?o WHERE {
     <http://dbpedia.org/resource/United_Parcel_Service> ?p ?o .
} LIMIT 100

Enter this query for me.

How did I know that the subject URI should be /United_Parcel_Service and not /UPS? I performed a quick look at Wikipedia. The URL http://en.wikipedia.org/wiki/UPS leads to an informational page about a variety of uses of the acronym, not the company page. From that informational page, however, I found the link to the company and there, I could see how the topic, United_Parcel_Service, was identified in the URL ( i.e. http://en.wikipedia.org/wiki/United_Parcel_Service). You can usually take the topic name from a Wikipedia URL and add it to the end of http://dbpedia.org/resource/ to create the DBpedia URI. 1

Of course, I can look at the results and see that UPS has fewer employees than IBM – 398,300 total at the time of this blog post. I could also use a slightly different kind of query to ask the question, “Is IBM bigger than UPS in terms of the total number of employees?” For the next query, you must first switch the results format from ‘Browse’ to ‘as XML’ or ‘as XML + XSLT’.

PREFIX ont: <http://dbpedia.org/ontology/>
 
ASK
{
  <http://dbpedia.org/resource/IBM> ont:numberOfEmployees ?ibm .
  <http://dbpedia.org/resource/United_Parcel_Service> ont:numberOfEmployees ?ups .
  FILTER(?ibm > ?ups) .
}

Enter this query for me.

Here’s a more sophisticated query for all companies that have more than 398300 employees, filtered for English labels only, and ordered by number of employees descending.

PREFIX ont: <http://dbpedia.org/ontology/>
 
SELECT ?company_name ?num_employees
WHERE {
    ?company a ont:Company;
             rdfs:label ?company_name ;
             ont:numberOfEmployees ?num_employees .
    FILTER (?num_employees > 398300 && lang(?company_name) = "en") .
} ORDER BY DESC(?num_employees) LIMIT 250

Enter this query for me.

As you can see, not all of the data is clean because Wikipedia editors have sometimes included the source date in the field where the number of employees was recorded. That’s an issue to be sorted out either with the DBpedia parsing engine or Wikipedia editors. It’s also something to keep in mind about the World Wide Web of linked open data; while it affords us broad access to a lot of valuable information, there are still issues with accuracy, proof, and trust – not too much different than those we’re familiar with in the current “Web of documents”. There is something fundamentally different about linking data than linking documents, however, and I hope that I have at least helped you get a sense of it.

SPARQL is one of the key technologies of the “next” Web and though it’s not as popular and well-understood as traditional SQL, it is mature and well-ready to deliver new value to the the modern corporate enterprise.

Is that your enterprise? Maybe, it should be.

Notes:

  1. Bob DuCharme has made this point with very similar words in his excellent O’REILLY®book, Learning SPARQL, Second Ed. He wrote, “If Wikipedia has a page for ‘Some_Topic’ at http://en.wikipedia.org/wiki/Some_Topic, the DBpedia URI to represent that resource is usually http://dbpedia.org/resource/Some_Topic.”