{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying Nexus knowledge graph using SPARQL\n", "\n", "The goal of this notebook is to learn the basics of SPARQL. Only the READ part of SPARQL will be exposed.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "This notebook assumes you've created a project within the AWS deployment of Nexus. If not follow the Blue Brain Nexus [Quick Start tutorial](https://bluebrain.github.io/nexus/docs/tutorial/getting-started/quick-start/index.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "You'll work through the following steps:\n", "\n", "1. Create a sparql wrapper around your project's SparqlView\n", "2. Explore and navigate data using the SPARQL query language\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Create a sparql wrapper around your project's SparqlView" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every project in Blue Brain Nexus comes with a SparqlView enabling to navigate the data as a graph and to query it using the W3C SPARQL Language. The address of such SparqlView is https://sandbox.bluebrainnexus.io/v1/views/tutorialnexus/\\$PROJECTLABEL/graph/sparql for a project withe label \\$PROJECTLABEL. The address of a SparqlView is also called a **SPARQL endpoint**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Configuration for the Nexus deployment\n", "nexus_deployment = \"https://sandbox.bluebrainnexus.io/v1\"\n", "\n", "token= \"your token here\"\n", "\n", "org =\"tutorialnexus\"\n", "project =\"$PROJECTLABEL\"\n", "\n", "headers = {}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let install sparqlwrapper which a python wrapper around sparql client\n", "!pip install git+https://github.com/RDFLib/sparqlwrapper" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Utility functions to create sparql wrapper around a sparql endpoint\n", "\n", "from SPARQLWrapper import SPARQLWrapper, JSON, POST, GET, POSTDIRECTLY, CSV\n", "import requests\n", "\n", "\n", "\n", "def create_sparql_client(sparql_endpoint, http_query_method=POST, result_format= JSON, token=None):\n", " sparql_client = SPARQLWrapper(sparql_endpoint)\n", " #sparql_client.addCustomHttpHeader(\"Content-Type\", \"application/sparql-query\")\n", " if token:\n", " sparql_client.addCustomHttpHeader(\"Authorization\",\"Bearer {}\".format(token))\n", " sparql_client.setMethod(http_query_method)\n", " sparql_client.setReturnFormat(result_format)\n", " if http_query_method == POST:\n", " sparql_client.setRequestMethod(POSTDIRECTLY)\n", " \n", " return sparql_client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Utility functions\n", "import pandas as pd\n", "\n", "pd.set_option('display.max_colwidth', -1)\n", "\n", "# Convert SPARQL results into a Pandas data frame\n", "def sparql2dataframe(json_sparql_results):\n", " cols = json_sparql_results['head']['vars']\n", " out = []\n", " for row in json_sparql_results['results']['bindings']:\n", " item = []\n", " for c in cols:\n", " item.append(row.get(c, {}).get('value'))\n", " out.append(item)\n", " return pd.DataFrame(out, columns=cols)\n", "\n", "# Send a query using a sparql wrapper \n", "def query_sparql(query, sparql_client):\n", " sparql_client.setQuery(query)\n", " \n", "\n", " result_object = sparql_client.query()\n", " if sparql_client.returnFormat == JSON:\n", " return result_object._convertJSON()\n", " return result_object.convert()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let create a sparql wrapper around the project sparql view\n", "sparqlview_endpoint = nexus_deployment+\"/views/\"+org+\"/\"+project+\"/graph/sparql\"\n", "sparqlview_wrapper = create_sparql_client(sparql_endpoint=sparqlview_endpoint, token=token,http_query_method= POST, result_format=JSON)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Explore and navigate data using the SPARQL query language\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let write our first query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "select_all_query = \"\"\"\n", "SELECT ?s ?p ?o\n", "WHERE\n", "{\n", " ?s ?p ?o\n", "}\n", "OFFSET 0\n", "LIMIT 5\n", "\"\"\"\n", "\n", "nexus_results = query_sparql(select_all_query,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most SPARQL queries you'll see will have the anotomy above with:\n", "* a **SELECT** clause that let you select the variables you want to retrieve\n", "* a **WHERE** clause defining a set of constraints that the variables should satisfy to be retrieved\n", "* **LIMIT** and **OFFSET** clauses to enable pagination\n", "* the constraints are usually graph patterns in the form of **triple** (?s for subject, ?p for property and ?o for ?object)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiple triples can be provided as graph pattern to match but each triple should end with a period. As an example, let retrieve 5 movies (?movie) along with their titles (?title)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_with_title = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select ?movie ?title\n", " WHERE {\n", " ?movie a vocab:Movie.\n", " ?movie vocab:title ?title.\n", "} LIMIT 5\n", "\"\"\"%(org,project)\n", "\n", "nexus_results = query_sparql(movie_with_title,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note PREFIX clauses. It is way to shorten URIS within a SPARQL query. Without them we would have to use full URI for all properties.\n", "\n", "The ?movie variable is bound to a URI (the internal Nexus id). Let retrieve the movieId just like in the MovieLens csv files for simplicity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_with_title = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select ?movieId ?title\n", " WHERE {\n", " \n", " # Select movies\n", " ?movie a vocab:Movie.\n", "\n", " # Select their movieId value\n", " ?movie vocab:movieId ?movieId.\n", " \n", " #\n", " ?movie vocab:title ?title.\n", " \n", "} LIMIT 5\n", "\"\"\"%(org,project)\n", "\n", "nexus_results = query_sparql(movie_with_title,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above query movies are things (or entities) of type vocab:Movie. \n", "This is a typical instance query where entities are filtered by their type(s) and then some of their properties are retrieved (here ?title). \n", "\n", "Let retrieve everything that is linked (outgoing) to the movies. \n", "The * character in the SELECT clause indicates to retreve all variables: ?movie, ?p, ?o" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_with_properties = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select *\n", " WHERE {\n", " ?movie a vocab:Movie.\n", " ?movie ?p ?o.\n", "} LIMIT 20\n", "\"\"\"%(org,project)\n", "\n", "nexus_results = query_sparql(movie_with_properties,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a little exercise, write a query retrieving incoming entities to movies. You can copy past the query above and modify it.\n", "\n", "Hints: ?s ?p ?o can be read as: ?o is linked to ?s with an outgoing link.\n", "\n", "Do you have results ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Your query here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let retrieve the movie ratings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_with_properties = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select ?userId ?movieId ?rating ?timestamp\n", " WHERE {\n", " ?movie a vocab:Movie.\n", " ?movie vocab:movieId ?movieId.\n", " \n", " \n", " ?ratingNode vocab:movieId ?ratingmovieId.\n", " ?ratingNode vocab:rating ?rating.\n", " ?ratingNode vocab:userId ?userId.\n", " ?ratingNode vocab:timestamp ?timestamp.\n", " \n", " # Somehow pandas is movieId as double for rating \n", " FILTER(xsd:integer(?ratingmovieId) = ?movieId)\n", " \n", "} LIMIT 20\n", "\"\"\"%(org,project)\n", "\n", "nexus_results = query_sparql(movie_with_properties,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a little exercise, write a query retrieving the movie tags along with the user id and timestamp. You can copy and past the query above and modify it.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Your query here\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregate queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Aggregates](https://www.w3.org/TR/sparql11-query/#aggregates) apply some operations over a group of solutions.\n", "Available aggregates are: COUNT, SUM, MIN, MAX, AVG, GROUP_CONCAT, and SAMPLE.\n", "\n", "We will not see them all but we'll look at some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next query will compute the average rating score for 'funny' movies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tag_value = \"funny\"\n", "movie_avg_ratings = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "\n", "Select ( AVG(?ratingvalue) AS ?score)\n", " WHERE {\n", " # Select movies\n", " ?movie a vocab:Movie.\n", "\n", " # Select their movieId value\n", " ?movie vocab:movieId ?movieId.\n", "\n", " ?tag vocab:movieId ?movieId.\n", " ?tag vocab:tag ?tagvalue.\n", " FILTER(?tagvalue = \"%s\").\n", "\n", " # Keep movies with ratings\n", " ?rating vocab:movieId ?ratingmovidId.\n", " FILTER(xsd:integer(?ratingmovidId) = xsd:integer(?movieId))\n", " ?rating vocab:rating ?ratingvalue.\n", "\n", "}\n", "\"\"\" %(org,project,tag_value)\n", "\n", "nexus_results = query_sparql(movie_avg_ratings,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "display(nexus_df.head(20))\n", "nexus_df=nexus_df.astype(float)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve the number of tags per movie. Can be a little bit slow depending on the size of your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nbr_tags_per_movie = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "\n", "Select ?title (COUNT(?tagvalue) as ?tagnumber)\n", " WHERE {\n", " # Select movies\n", " ?movie a vocab:Movie.\n", " # Select their movieId value\n", " ?movie vocab:movieId ?movieId.\n", " \n", " ?tag a vocab:Tag.\n", " ?tag vocab:movieId ?tagmovieId.\n", " FILTER(?tagmovieId = ?movieId)\n", " ?movie vocab:title ?title.\n", " ?tag vocab:tag ?tagvalue.\n", "}\n", "\n", "GROUP BY ?title \n", "ORDER BY DESC(?tagnumber)\n", "LIMIT 10\n", "\"\"\" %(org,project)\n", "\n", "nexus_results = query_sparql(nbr_tags_per_movie,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "display(nexus_df.head(20))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Let plot the result\n", "nexus_df.tagnumber = pd.to_numeric(nexus_df.tagnumber)\n", "nexus_df.plot(x=\"title\",y=\"tagnumber\",kind=\"bar\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next query will retrieve movies along with users that tagged them separated by a comma" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group Concat\n", "\n", "movie_tag_users = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "\n", "Select ?movieId (group_concat(DISTINCT ?userId;separator=\",\") as ?users)\n", " WHERE {\n", " # Select movies\n", " ?movie a vocab:Movie.\n", "\n", " # Select their movieId value\n", " ?movie vocab:movieId ?movieId.\n", "\n", " ?tag vocab:movieId ?movieId.\n", " ?tag vocab:userId ?userId.\n", "\n", " \n", "}\n", "GROUP BY ?movieId\n", "LIMIT 10\n", "\"\"\"%(org,project)\n", "\n", "nexus_results = query_sparql(movie_tag_users,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }