Solving the Six Degrees of Kevin Bacon Problem 1

This article shows you how to solve the “Six Degrees of Kevin Bacon” game using a mixture of SPARQL and Python.

SPARQL

SPARQL is a query language for triple stores that was born out of the Semantic Web effort. A triple is a simple 3 part statement or ‘fact’ such as “Australia is Country”. This consists of two objects (‘Australia’ and ‘Country’) and the predicate (also known as a property), ‘is’.

There are a number of triple stores online. We will be using DBpedia which stores triples extracted from the Wikipedia pages. we will assume the actor listings are complete, although this is unlikely. There are a range of SPARQL applications in the wild, including Robotopedia which builds encyclopedia pages from the similar Wikidata database.

SPARQL’s syntax is inspired by SQL and consists of a list of variables to return, a WHERE clause, and some other optional clauses such as ‘LIMIT’. The WHERE clause consists of one or more triples to match. Objects or properties can be replaced with variables indicated by a ‘?’ character. These are the values to return.

For example, if querying the DBpedia database, the following SPARQL query searches for all films that starred Kevin Bacon:

Try it with the DBpedia SPARQL endpoint at: dbpedia.org/snorql/ . The ‘dbo:’ is a namespace abbreviation. SPARQL finds all triples in its database that have the specified predicate and second object, and returns all of the ?film objects from these triples.

We can extend the above query to match multiple triples. This works like an ‘AND’. For example, the following queries all films that Kevin Bacon starred in, and all actors he starred with on those films:

This can be extended with multiple steps (‘hops’) to additional films:

FILTER is a potentially powerful statement and can be used to filter the matches within specific constraints. Here we use them to avoid “loops” where a path goes through the same actor and/or film multiple times. However it does not guarantee the shortest routes. For example, the above query gives a number of one-hop results (e.g. via “R.I.P.D.”) for Jeff Bridges but then also gives the two-hop result of “The Air I Breathe” / Forest Whitaker / “Blown Away”.

Another problem is that both the query text and the search space explode very quickly. The above query can be expanded to include six films, and to only return the path from the actor of interest to Kevin Bacon, but this will probably lead to a timeout on the public server, and it would not report shorter paths. Ideally we want something that will return the shortest path.

Enter Python…

Although DBpedia’s SPARQL implementation does include some graph traversing extensions, these are difficult to use and non-standard. Our solution is to use Python to script the SPARQL queries. This allows us to add additional logic, limiting the queries that are performed.

The SPARQLWrapper library is used to make SPARQL calls, and can be installed with pip. Responses can be in XML or JSON. We shall use parsed JSON for convenience. Here is an example SPARQL call:

This queries DBPedia for all films starring Kevin Bacon, and all actors starring in those films. The parsed JSON result is displayed to the terminal.

From this, we can build up a new implementation. First some definitions, and then a new function query all films starring an actor,  and all actors starring in those films:

Notice the call to time.sleep(). This short delay is to throttle our queries. DBpedia is a public service and we should not flood it with a sudden influx of queries. You could remove this if you were working against your own local mirror of the DBpedia database.

To perform the search we use a very basic version of the Djikstra algorithm. This is used to find the optimum route through a graph network. The classic Djikstra algorithm has ‘weights’ (e.g. travel times) for the links and uses a priority queue to prioritize the next shortest path to work on. Here, all links (‘hops’) are of equal weight. Therefore we simply search through all 1-hop routes, then 2-hop routes, etc.

Our next method expands one path out to all films and actors that we have not yet visited:

Finally, we have the main loop:

When using this script from the command line, specify the actor as the first parameter.  The actor needs to be formatted as per the DBpedia label, e.g. “Gillian_Anderson” and without a namespace. The above script will add the namespace.

Results

Here are the results for John Mills and Charlie Chaplin:

 

Further Extensions

You could just as easily search out from Kevin Bacon to the target actor. In fact, you could do this without a specific target and end up with a list of all actors who can be reached within six steps!  The search space will explode for this many hops, so this is probably only practical if you are querying a local database.

One comment on “Solving the Six Degrees of Kevin Bacon Problem

  1. Reply Richard Marsden Jul 31,2019 8:41 pm

    The source code for the Python script has now been uploaded to github at:

    https://github.com/winwaed/kevin_bacon

Leave a Reply to Richard Marsden Cancel Reply