{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linking data on the web\n", "\n", "In this notebook you'll experience how to consume structured data available on Wikidata to extend and enrich data managed within Nexus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "This notebook assumes you've created a project within the AWS deployment of Blue Brain Nexus. If not follow the Blue Brain Nexus [Quick Start tutorial](https://bluebrain.github.io/nexus/docs/tutorial/getting-started/quick-start/index.html).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "You'll work through the following steps:\n", "\n", "1. Install and configure the Blue Brain Nexus python sdk\n", "2. Create a sparql wrapper around your project's SparqlView\n", "3. Create a sparql wrapper around the Wikidata Sparql Service\n", "4. Query more metadata from the Wikidata Sparql Service for movies stored in Nexus using their tmdbId\n", "5. Save the extended movies metadata back to Nexus and perform new queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Install and configure the Blue Brain Nexus python sdk" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install git+https://github.com/BlueBrain/nexus-cli" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import nexussdk as nexus\n", "\n", "\n", "nexus_deployment = \"https://sandbox.bluebrainnexus.io/v1\"\n", "\n", "token = \"your token here\"\n", "nexus.config.set_environment(nexus_deployment)\n", "nexus.config.set_token(token)\n", "\n", "\n", "org =\"tutorialnexus\"\n", "\n", "#Provide your project name here.\n", "project =\"your_project_here\"\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Create a sparql wrapper around your project's SparqlView" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every project in Blue Brain Nexus comes with a SparqlView enabling to navigate the data as a graph and to query it using the [W3C SPARQL Language](https://www.w3.org/TR/sparql11-query/). \n", "The address of such SparqlView is https://sandbox.bluebrainnexus.io/v1/views/tutorialnexus/\\$PROJECTLABEL/graph/sparql for a project withe label \\$PROJECTLABEL. The address of a SparqlView is also called a sparql endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install git+https://github.com/RDFLib/sparqlwrapper" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Utility functions to create sparql wrapper around a sparql endpoint\n", "\n", "from SPARQLWrapper import SPARQLWrapper, JSON, POST, GET, POSTDIRECTLY, CSV\n", "import requests\n", "\n", "\n", "\n", "def create_sparql_client(sparql_endpoint, http_query_method=POST, result_format= JSON, token=None):\n", " sparql_client = SPARQLWrapper(sparql_endpoint)\n", " #sparql_client.addCustomHttpHeader(\"Content-Type\", \"application/sparql-query\")\n", " if token:\n", " sparql_client.addCustomHttpHeader(\"Authorization\",\"Bearer {}\".format(token))\n", " sparql_client.setMethod(http_query_method)\n", " sparql_client.setReturnFormat(result_format)\n", " if http_query_method == POST:\n", " sparql_client.setRequestMethod(POSTDIRECTLY)\n", " \n", " return sparql_client\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Utility functions\n", "import pandas as pd\n", "\n", "pd.set_option('display.max_colwidth', -1)\n", "\n", "# Convert SPARQL results into a Pandas data frame\n", "def sparql2dataframe(json_sparql_results):\n", " cols = json_sparql_results['head']['vars']\n", " out = []\n", " for row in json_sparql_results['results']['bindings']:\n", " item = []\n", " for c in cols:\n", " item.append(row.get(c, {}).get('value'))\n", " out.append(item)\n", " return pd.DataFrame(out, columns=cols)\n", "\n", "# Send a query using a sparql wrapper \n", "def query_sparql(query, sparql_client):\n", " sparql_client.setQuery(query)\n", " \n", " result_object = sparql_client.query()\n", " if sparql_client.returnFormat == JSON:\n", " return result_object._convertJSON()\n", " return result_object.convert()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let create a sparql wrapper around the project sparql view\n", "sparqlview_endpoint = nexus_deployment+\"/views/\"+org+\"/\"+project+\"/graph/sparql\"\n", "sparqlview_wrapper = create_sparql_client(sparql_endpoint=sparqlview_endpoint, token=token,http_query_method= POST, result_format=JSON)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let test that the SparqlView wrapper works by running a simple Sparql query to get 5 movies along with their titles." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "five_movie_query = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select ?movie_nexus_Id ?movieId ?title ?genres ?imdbId ?tmdbId ?revision\n", " WHERE {\n", " \n", " ?movie_nexus_Id a vocab:Movie.\n", " ?movie_nexus_Id nxv:rev ?revision.\n", " ?movie_nexus_Id vocab:movieId ?movieId.\n", " ?movie_nexus_Id vocab:title ?title.\n", " ?movie_nexus_Id vocab:imdbId ?imdbId.\n", " ?movie_nexus_Id vocab:genres ?genres.\n", " OPTIONAL {\n", " ?movie_nexus_Id vocab:tmdbId ?tmdbId.\n", " }\n", "} LIMIT 5\n", "\"\"\" %(org, project)\n", "\n", "nexus_results = query_sparql(five_movie_query,sparqlview_wrapper)\n", "\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Create a Sparql wrapper around the Wikidata Sparql Service" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wikidata_sparql_endpoint = \"https://query.wikidata.org/sparql\"\n", "wikidata_sparql_wrapper = create_sparql_client(sparql_endpoint=wikidata_sparql_endpoint,http_query_method= GET, result_format=JSON)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let test that the wrapper works by running a query that fetch the logo url for a given movie tmdbId 862 (Toy Story).\n", "You can play the following query in the Wikidata playground [Try It](https://query.wikidata.org/#SELECT%20%2a%0A%20%20%20%20WHERE%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP4947%20%22862%22.%0A%20%20%20%20%20%20%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP154%20%3Flogo.%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP364%20%3Flang.%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP577%20%3FreleaseDate.%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP291%20wdt%3AQ30.%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%7D).\n", "\n", "In the query below:\n", "\n", "* [wdt:P4947](https://www.wikidata.org/wiki/Property:P4947) is the wikidata property for tmdbId\n", "* [wdt:P154](https://www.wikidata.org/wiki/Property:P154) is the wiki data property for logo image url\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# wdt:P4947 is the wikidata property for tmdbId\n", "\n", "movie_logo_query = \"\"\"\n", "SELECT *\n", "WHERE\n", "{\n", " ?movie wdt:P4947 \"%s\".\n", " OPTIONAL{\n", " ?movie wdt:P154 ?logo.\n", " }\n", "}\n", "\"\"\" % (862)\n", "wiki_results = query_sparql(movie_logo_query,wikidata_sparql_wrapper)\n", "wiki_df =sparql2dataframe(wiki_results)\n", "wiki_df.head()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let display the logo of the Toy Story movie. This part might take some time but you can skip it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import SVG, display\n", "\n", "movie_logo_url = wiki_df.at[0,'logo']\n", "display(SVG(movie_logo_url))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Query more metadata from the Wikidata Sparql Service" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For every movie retrieved from the Nexus SparqlView, we will get:\n", "\n", "* its cast members or voice actors\n", "* its publication date in the United States of America if any\n", "\n", "[Try it](https://query.wikidata.org/#SELECT%20%3FtmdbId%20%3Fmovie%20%3Flogo%20%3Fnativelanguage%20%3Fpublication_date%20%3Fcast%20%3FgivenName%20%3FfamilyName%0A%20%20%20%20WHERE%0A%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP4947%20%3FtmdbId.%0A%20%20%20%20%20%20%20%20FILTER%20%28%3FtmdbId%20%3D%20%22862%22%29.%0A%20%20%20%20%20%20%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP154%20%3Flogo.%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP725%7Cwdt%3AP161%20%3Fcast.%0A%20%20%20%20%20%20%20%20%20%20%3Fcast%20wdt%3AP735%2Fwdt%3AP1705%20%3FgivenName.%0A%20%20%20%20%20%20%20%20%20%20%3Fcast%20wdt%3AP734%2Fwdt%3AP1705%20%3FfamilyName.%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20OPTIONAL%7B%0A%20%20%20%20%20%20%20%20%3Fmovie%20wdt%3AP364%2Fwdt%3AP1705%20%3Fnativelanguage.%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%0A%20%20%20%20%20%20OPTIONAL%20%7B%0A%20%20%20%20%20%3Fmovie%20p%3AP577%20%3Fpublication_date_node.%0A%20%20%20%20%20%3Fpublication_date_node%20ps%3AP577%20%3Fpublication_date.%20%23%20publication%20date%20statement%0A%20%20%20%20%20%3Fpublication_date_node%20pq%3AP291%20wd%3AQ30.%20%20%20%20%20%20%20%20%20%20%20%20%23%20qualifier%20on%20the%20release%20date%0A%20%20%7D%0A%20%20%20%20%7D) for the movie Toy Story." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from functools import reduce\n", "import json\n", "\n", "\n", "def panda_merge(df,df2, on):\n", " cols_to_use = df2.columns.difference(df.columns)\n", " dfNew = pd.merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')\n", " return dfNew\n", "\n", "def panda_concatenate(dfs):\n", " df = pd.concat(dfs)\n", " return df\n", "\n", "wiki_dataframes = []\n", "for index, row in nexus_df.iterrows():\n", " imdbdId = row['tmdbId']\n", " movie_logo_query = \"\"\"\n", " SELECT ?tmdbId ?movie ?logo ?nativelanguage ?publication_date ?cast ?givenName ?familyName\n", " WHERE\n", " {\n", " ?movie wdt:P4947 ?tmdbId.\n", " FILTER (?tmdbId = \"%s\").\n", " OPTIONAL{\n", " ?movie wdt:P154 ?logo.\n", " }\n", " OPTIONAL{\n", " ?movie wdt:P725|wdt:P161 ?cast.\n", " ?cast wdt:P735/wdt:P1705 ?givenName.\n", " ?cast wdt:P734/wdt:P1705 ?familyName.\n", " }\n", " OPTIONAL{\n", " ?movie wdt:P364/wdt:P1705 ?nativelanguage.\n", " }\n", " \n", " OPTIONAL {\n", " ?movie p:P577 ?publication_date_node.\n", " ?publication_date_node ps:P577 ?publication_date. # publication date statement\n", " ?publication_date_node pq:P291 wd:Q30. # qualifier on the release date\n", " }\n", " }\n", " \"\"\" % (row['tmdbId'])\n", "\n", " wiki_results = query_sparql(movie_logo_query,wikidata_sparql_wrapper)\n", " wiki_df =sparql2dataframe(wiki_results)\n", " print(\"\"\"Display metadata (from wikidata) for %s\"\"\" %(nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'movie_nexus_Id'].iloc[0]))\n", " display(wiki_df.head())\n", " wiki_dataframes.append(wiki_df)\n", "\n", "\n", "#Let concatenate all dataframes from wikidata\n", "result_wiki_dataframes = panda_concatenate(wiki_dataframes)\n", "#display(result_wiki_dataframes.head())\n", "\n", "\n", "merge_wiki_dataframes = panda_merge(result_wiki_dataframes,nexus_df,\"tmdbId\")\n", "#display(merge_wiki_dataframes.head())\n", "wiki_dataframes_tojson = (merge_wiki_dataframes.apply(lambda x: x.dropna(), 1).groupby(['tmdbId','movie','nativelanguage'], as_index=False)\n", " .apply(lambda x: x[['cast','givenName','familyName']].to_dict('r'))\n", " .reset_index()\n", " .rename(columns={0:'casting'})\n", " .to_json(orient='records'))\n", "\n", "\n", "\n", "\n", "\n", "updated_movies_json = json.loads(wiki_dataframes_tojson)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Save the extended movies metadata back to Nexus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# We obtained a json array to be loaded\n", "from urllib.parse import urlencode, quote_plus\n", "\n", "def update_in_nexus(row):\n", " row[\"@type\"]= \"Movie\"\n", " _id = nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'movie_nexus_Id'].iloc[0]\n", " rev = nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'revision'].iloc[0]\n", " row[\"title\"] = nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'title'].iloc[0]\n", " row[\"genres\"] = nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'genres'].iloc[0]\n", " row[\"movieId\"] = nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'movieId'].iloc[0]\n", " row[\"imdbId\"] = nexus_df.loc[nexus_df['tmdbId'] == row['tmdbId'], 'imdbId'].iloc[0]\n", " \n", " \n", " data = nexus.resources.fetch(org_label=org,project_label=project,resource_id=_id,schema_id=\"_\")\n", " current_revision = data[\"_rev\"]\n", " \n", " \n", " url = nexus_deployment+\"/resources/\"+org+\"/\"+project+\"/\"+\"_/\"+quote_plus(_id)\n", " row[\"_self\"] = url\n", " nexus.resources.update(resource=row, rev=current_revision)\n", " \n", "for item in updated_movies_json:\n", " update_in_nexus(item)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data can now be listed with the casting metadata obtained from wikidata.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List movies with their casting names separated by a comma.\n", "\n", "movie_acting_query = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select ?movieId ?title ?genres ?imdbId ?tmdbId (group_concat(DISTINCT ?castname;separator=\", \") as ?casting)\n", " WHERE {\n", " \n", " ?movie_nexus_Id a vocab:Movie.\n", " ?movie_nexus_Id vocab:casting ?cast.\n", " ?cast vocab:givenName ?givenName.\n", " ?cast vocab:familyName ?familyName.\n", " BIND (CONCAT(?givenName, \" \", ?familyName) AS ?castname).\n", " \n", " ?movie_nexus_Id vocab:movieId ?movieId.\n", " ?movie_nexus_Id vocab:title ?title.\n", " ?movie_nexus_Id vocab:imdbId ?imdbId.\n", " ?movie_nexus_Id vocab:genres ?genres.\n", " OPTIONAL {\n", " ?movie_nexus_Id vocab:tmdbId ?tmdbId.\n", " }\n", "}\n", "Group By ?movieId ?title ?genres ?imdbId ?tmdbId\n", "LIMIT 100\n", "\"\"\" %(org,project)\n", "\n", "nexus_updated_results = query_sparql(movie_acting_query,sparqlview_wrapper)\n", "nexus_df_updated=sparql2dataframe(nexus_updated_results)\n", "nexus_df_updated.head(100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pick an actor and find movies within which they acted\n", "actor = updated_movies_json[0][\"casting\"][0]\n", "given_name = actor[\"givenName\"]\n", "family_name = actor[\"familyName\"]\n", "\n", "query = \"\"\"\n", "PREFIX vocab: \n", "PREFIX nxv: \n", "Select ?movieId ?title ?genres ?imdbId ?tmdbId\n", " WHERE {\n", " \n", " ?movie_nexus_Id a vocab:Movie.\n", " ?movie_nexus_Id vocab:casting ?cast.\n", " ?cast vocab:familyName \"%s\".\n", " ?cast vocab:givenName \"%s\".\n", " ?movie_nexus_Id vocab:movieId ?movieId.\n", " ?movie_nexus_Id vocab:title ?title.\n", " ?movie_nexus_Id vocab:imdbId ?imdbId.\n", " ?movie_nexus_Id vocab:genres ?genres.\n", " OPTIONAL {\n", " ?movie_nexus_Id vocab:tmdbId ?tmdbId.\n", " }\n", "} LIMIT 5\n", "\"\"\" % (org, project, family_name, given_name)\n", "\n", "nexus_results = query_sparql(query,sparqlview_wrapper)\n", "\n", "nexus_df =sparql2dataframe(nexus_results)\n", "nexus_df.head()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }