{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create and Query a Nexus Elasticsearch view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this notebook is to learn how to connect to an Elasticsearch view and run queries against it. It is not a tutorial on the elasticsearch DSL language for which many well written [learning resources are available](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "This notebook assumes you've already created project. If not follow the Blue Brain Nexus [Quick Start tutorial](https://bluebrain.github.io/nexus/docs/tutorial/getting-started/quick-start/index.html).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "You'll work through the following steps:\n", "\n", "1. Create an Elasticsearch wrapper around your project's ElasticsearchView\n", "2. Explore and search data using the wrapper as well as the Elasticsearch DSL language\n", "\n", "\n", "This tutorial makes use of the [elasticsearch](https://github.com/elastic/elasticsearch-py) and [elasticsearch-dsl](https://github.com/elastic/elasticsearch-dsl-py) python clients allowing to connect to an elasticsearch search endpoint and perform various types of queries against it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Create an Elasticsearch wrapper around your project's ElasticsearchView" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every project in Blue Brain Nexus comes with an ElasticsearchView enabling to explore and search datausing the Elasticsearch DSL language. The address of such ElasticsearchView is \n", "`https://sandbox.bluebrainnexus.io/v1/views/tutorialnexus/$PROJECTLABEL/_search` for a project with label $PROJECTLABEL. The organiation 'tutorialnexus' is the one used throughout the tutorial but it can be replaced by any other organization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install [elasticsearch](https://github.com/elastic/elasticsearch-py) and [elasticsearch-dsl](https://github.com/elastic/elasticsearch-dsl-py)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install https://github.com/elastic/elasticsearch-py\n", "!pip install elasticsearch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install https://github.com/elastic/elasticsearch-dsl-py\n", "!pip install elasticsearch-dsl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set Nexus deployment configuration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import getpass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "token = getpass.getpass()" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [], "source": [ "deployment = \"https://nexus-sandbox.io/v1\"\n", "org_label = \"tutorialnexus\"\n", "project_label =\"myProject\"\n", "\n", "headers = {}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare a custom connection class for elasticsearch \n", "\n", "There is a need to extends the default Urllib3HttpConnection used by the elasticsearch client to:\n", " - include custom headers like an Authorization token\n", " - change the default search endpoint full_url construction method (perform_request):\n", " - full_url = self.host + self.url_prefix + \"/_all/_search\" +\"%s?%s\" % (url, urlencode(params))\n", " - self.url_prefix is the address of the provided ElasticsearchView\n", " - The new full_url construction method is:\n", " - full_url = self.host + self.url_prefix as Nexus hide the target index and handle through its API the search parameters like (from or size)\n", "\n", "Solution partially taken from: from https://github.com/elastic/elasticsearch-py/issues/407" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from elasticsearch import Elasticsearch, Urllib3HttpConnection\n", "from elasticsearch.serializer import JSONSerializer, DEFAULT_SERIALIZERS\n", "from elasticsearch_dsl import Search\n", "\n", "\n", "\n", "import time\n", "import ssl\n", "import urllib3\n", "from urllib3.exceptions import ReadTimeoutError, SSLError as UrllibSSLError\n", "from urllib3.util.retry import Retry\n", "import warnings\n", "import gzip\n", "from base64 import decodestring\n", "\n", "\n", "class MyConnection(Urllib3HttpConnection):\n", " def __init__(self,*args, **kwargs):\n", " extra_headers = kwargs.pop('extra_headers', {})\n", " super(MyConnection, self).__init__(*args, **kwargs)\n", " self.headers.update(extra_headers)\n", " \n", " def perform_request(\n", " self, method, url, params=None, body=None, timeout=None, ignore=(), headers=None\n", " ):\n", " \n", " #url = self.url_prefix +url\n", " url = self.url_prefix\n", " \n", " #if params:\n", " # url = \"%s?%s\" % (url, urlencode(params))\n", " full_url = self.host + self.url_prefix\n", "\n", " start = time.time()\n", " \n", " try:\n", " kw = {}\n", " if timeout:\n", " kw[\"timeout\"] = timeout\n", "\n", " # in python2 we need to make sure the url and method are not\n", " # unicode. Otherwise the body will be decoded into unicode too and\n", " # that will fail (#133, #201).\n", " if not isinstance(url, str):\n", " url = url.encode(\"utf-8\")\n", " if not isinstance(method, str):\n", " method = method.encode(\"utf-8\")\n", "\n", " request_headers = self.headers\n", " if headers:\n", " request_headers = request_headers.copy()\n", " request_headers.update(headers)\n", " if self.http_compress and body:\n", " try:\n", " body = gzip.compress(body)\n", " except AttributeError:\n", " # oops, Python2.7 doesn't have `gzip.compress` let's try\n", " # again\n", " body = gzip.zlib.compress(body)\n", "\n", " response = self.pool.urlopen(\n", " method, url, body, retries=Retry(False), headers=request_headers, **kw\n", " )\n", " duration = time.time() - start\n", " raw_data = response.data.decode(\"utf-8\")\n", " except Exception as e:\n", " self.log_request_fail(\n", " method, full_url, url, body, time.time() - start, exception=e\n", " )\n", " if isinstance(e, UrllibSSLError):\n", " raise SSLError(\"N/A\", str(e), e)\n", " if isinstance(e, ReadTimeoutError):\n", " raise ConnectionTimeout(\"TIMEOUT\", str(e), e)\n", " raise ConnectionError(\"N/A\", str(e), e)\n", "\n", " # raise errors based on http status codes, let the client handle those if needed\n", " if not (200 <= response.status < 300) and response.status not in ignore:\n", " self.log_request_fail(\n", " method, full_url, url, body, duration, response.status, raw_data\n", " )\n", " self._raise_error(response.status, raw_data)\n", "\n", " self.log_request_success(\n", " method, full_url, url, body, response.status, raw_data, duration\n", " )\n", "\n", " return response.status, response.getheaders(), raw_data\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create an elasticsearch wrapper" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "es_endpoint = deployment+\"/views/\"+org_label+\"/\"+project_label+\"/nxv%3AdefaultElasticSearchIndex/_search\"\n", "print(\"Elasticsearch View address: \" +es_endpoint)\n", "\n", "DEFAULT_SERIALIZERS[\"application/ld+json\"] = JSONSerializer()\n", "es_wrapper = Elasticsearch(es_endpoint, connection_class=MyConnection,send_get_body_as='POST', extra_headers={\"Authorization\":\"Bearer {}\".format(token)})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let test the elasticsearch wrapper by running a simple filter and aggregation queries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve non deprecated resources\n", "s = Search(using=es_wrapper) \\\n", " .filter(\"term\", _deprecated=\"false\")\n", "\n", "# Aggregate them by type\n", "s.aggs.bucket('per_type', 'terms', field='@type')\n", "\n", "response = s.execute()\n", "total = response.hits.total\n", "print('total hits', total.relation, total.value)\n", "\n", "# Don't forget that ressources are paginated with 10 as default: use from and size (on Nexus API) to get all the hits\n", "print(\"Displaying 10 first search results (score id type)\")\n", "for hit in response:\n", " print(hit.meta.score, hit['@id'], \"\" if \"@type\" not in hit else hit['@type'])\n", " \n", "print(\"Displaying aggregation\")\n", "for _type in response.aggregations.per_type.buckets:\n", " print(str(_type.doc_count)+ \" resources of type \"+_type.key)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Create an ElasticSearchView\n", "\n", "The goal here is to illustrate hwo to use the Nexus SDK to create an Elasticsearch view. The full documentation can be found at: https://bluebrainnexus.io/docs/api/current/kg/kg-views-api.html#create-an-elasticsearchview-using-post.\n", "\n", "```\n", "{\n", " \"@id\": \"{someid}\",\n", " \"@type\": [ \"View\", \"ElasticSearchView\"],\n", " \"resourceSchemas\": [ \"{resourceSchema}\", ...],\n", " \"resourceTypes\": [ \"{resourceType}\", ...],\n", " \"resourceTag\": \"{tag}\",\n", " \"sourceAsText\": {sourceAsText},\n", " \"includeMetadata\": {includeMetadata},\n", " \"includeDeprecated\": {includeDeprecated},\n", " \"mapping\": _elasticsearch mapping_\n", "}\n", "```\n", "\n", "An ElasticSearchView is a way to tell Nexus:\n", "\n", "1. Which resources to index in the view: \n", "\n", " * resources that conform to a given schema: set resourceSchemas to the targeted schemas\n", "\n", " * resources that are of a given type: set resourceTypes to the targeted types\n", "\n", " * resources that are tagged: set resourceTag to the targeted tag value.\n", " \n", "2. Which mapping to use when indexing the selected resources:\n", "\n", " * set mappingto be used: [More info about Elasticsearch mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install nexus-sdk" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nexussdk as nexus\n", "nexus.config.set_environment(deployment)\n", "nexus.config.set_token(token)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type_to_index = \"https://bluebrain.github.io/nexus/vocabulary/File\"\n", "\n", "view_data = {\n", " \"@id\": \"http://myView6.com\",\n", " \"@type\": [\n", " \"ElasticSearchView\"\n", " ],\n", " \"includeMetadata\": True,\n", " \"includeDeprecated\": False,\n", " \"resourceTypes\":type_to_index,\n", " \"mapping\": {\n", " \"dynamic\": \"false\",\n", " \"properties\": {\n", " \"@id\": {\n", " \"type\": \"keyword\"\n", " },\n", " \"@type\": {\n", " \"type\": \"keyword\"\n", " }\n", " }\n", " },\n", " \"sourceAsText\": False\n", "}\n", "\n", "try:\n", " response = nexus.views.create_es(org_label=org_label, project_label=project_label,view_data=view_data)\n", " print(dict(response))\n", "except nexus.HTTPError as ne:\n", " print(ne.response.json())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List views " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = nexus.views.list(org_label=org_label, project_label=project_label)\n", "print(dict(response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Views statistics \n", "\n", "Please refer to https://bluebrainnexus.io/docs/api/current/kg/kg-views-api.html#fetch-view-statistics for more details about how to access the view indexing progress." ] } ], "metadata": { "kernelspec": { "display_name": "Python (nexus-cli)", "language": "python", "name": "nexus-cli" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }