{ "cells": [ { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-09-23T18:50:19.036357Z", "start_time": "2019-09-23T18:50:19.031896Z" } }, "source": [ "# Resolving\n", "\n", "A [Resolver](https://nexus-forge.readthedocs.io/en/latest/interaction.html#resolving) is used to link terms or a `Resource` to identifiers (URIs) in a knowledge graph thus addressing lexical variations\n", "(merging of synonyms, aliases and acronyms) and disambiguating them. This feature is also referred to as entity linking\n", "specially in the context of Natural Language Processing (NLP) when building knowledge graph from entities extracted from\n", "text documents." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-09-23T18:50:20.068658Z", "start_time": "2019-09-23T18:50:19.054054Z" } }, "outputs": [], "source": [ "from kgforge.core import KnowledgeGraphForge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A configuration file is needed in order to create a KnowledgeGraphForge session. A configuration can be generated using the notebook [00-Initialization.ipynb](00%20-%20Initialization.ipynb)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "forge = KnowledgeGraphForge(\"../../configurations/forge.yml\", debug=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from kgforge.core.commons.strategies import ResolvingStrategy\n", "from kgforge.core.resource import Resource" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discover configured resolvers\n", "With the `forge.resolvers()` method, configured resolvers can be inspected." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Available scopes:\n", " - entities :\n", " - resolver: DemoResolver\n", " - targets: agents\n", " - ontology :\n", " - resolver: DemoResolver\n", " - targets: cells\n", " - schemaorg :\n", " - resolver: EntityLinkerSkLearn\n", " - targets: terms\n", " - terms :\n", " - resolver: DemoResolver\n", " - targets: sexontology\n" ] } ], "source": [ "forge.resolvers() # The values are taken from \"../../configurations/forge.yml\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A scope is a convenient (and arbitrary) way to name a given Resolver along with a set of sources of data (the `targets`) to resolve against. Resolve a resource for `female` in the 'terms' resolving scope." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get resolvers as dictionary\n", "\n", "Passing `output=\"dict\"` as parameter in `forge.resolvers()` returns the resolvers as a dictionary of scopes and their\n", "respective targets." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'entities': {'agents': {'bucket': 'agents.json'}},\n", " 'ontology': {'cells': {'bucket': 'cell_types.json'}},\n", " 'schemaorg': {'terms': {'bucket': 'tfidfvectorizer_model_schemaorg_linking'}},\n", " 'terms': {'sexontology': {'bucket': 'sex.json'}}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "resolvers = forge.resolvers(output=\"dict\")\n", "resolvers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DemoResolver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DemoResolver resolve a term using str comparision and is looking up in a json file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### scope" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Resolve the text`female` againt the 'terms' resolving scope." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "female = forge.resolve(text=\"female\", scope=\"terms\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kgforge.core.resource.Resource" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(female)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(female)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### use exact match" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "assert forge.resolve(text=\"feMAle\", scope=\"terms\", strategy=ResolvingStrategy.EXACT_MATCH) == None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### now exact but case-insensitive" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(forge.resolve(text=\"feMAle\", scope=\"terms\", strategy=ResolvingStrategy.EXACT_CASE_INSENSITIVE_MATCH))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### check it should be exact " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "assert forge.resolve(text=\"emale\", scope=\"terms\", strategy=ResolvingStrategy.EXACT_CASE_INSENSITIVE_MATCH) == None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### resolve with best match" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(forge.resolve(text=\"emale\", scope=\"terms\", strategy=ResolvingStrategy.BEST_MATCH))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Resolve the text `EPFL` against the 'entities' resolving scope." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "epfl = forge.resolve(\"EPFL\", scope=\"entities\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kgforge.core.resource.Resource" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(epfl)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://www.grid.ac/institutes/grid.5333.6\n", " type: Organization\n", " label: École Polytechnique Fédérale de Lausanne\n", " acronym: EPFL\n", "}\n" ] } ], "source": [ "print(epfl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### target" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"female\", scope=\"terms\", target=\"sexontology\"))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://www.grid.ac/institutes/grid.5333.6\n", " type: Organization\n", " label: École Polytechnique Fédérale de Lausanne\n", " acronym: EPFL\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"EPFL\", scope=\"entities\", target=\"agents\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### type" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"female\", scope=\"terms\", type=\"Class\"))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://www.grid.ac/institutes/grid.5333.6\n", " type: Organization\n", " label: École Polytechnique Fédérale de Lausanne\n", " acronym: EPFL\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"EPFL\", scope=\"entities\", type=\"Organization\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Strategies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different strategies can be used to rank resolving candidates. \n", "\n", "In the following example, the missing 'e' at the end is intended for the demonstration." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "text = \"mal\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### best match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default applied strategy is `strategy=ResolvingStrategy.BEST_MATCH`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000384\n", " type: Class\n", " label: male\n", "}\n" ] } ], "source": [ "print(forge.resolve(text, scope=\"terms\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### exact match" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(forge.resolve(text, scope=\"terms\", strategy=ResolvingStrategy.EXACT_MATCH))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### fuzzy match (all matches)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The candidates list is ordered by score." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "results = forge.resolve(text, scope=\"terms\", strategy=ResolvingStrategy.ALL_MATCHES, limit=3)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(results)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(results)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kgforge.core.resource.Resource" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(results[0])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000384\n", " type: Class\n", " label: male\n", "}\n", "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(*results, sep=\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use case with cell types" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "pyramidal = 'Pyramidal Neuron'\n", "cell_characters = \"Lamp+\"\n", "hard_name = \"270_L5/6 NP CT CTX\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exact match" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://neuroshapes.org/PyramidalNeuron\n", " type: Class\n", " label: Pyramidal Neuron\n", "}\n" ] } ], "source": [ "print(forge.resolve(pyramidal, scope=\"ontology\", strategy=\"EXACT_MATCH\"))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/celltypes/Lamp_plus\n", " type: Class\n", " label: Lamp+\n", "}\n" ] } ], "source": [ "print(forge.resolve(cell_characters, scope=\"ontology\", strategy=\"EXACT_MATCH\"))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/ttypes/270_L5_6_NP_CT_CTX\n", " type: Class\n", " label: 270_L5/6 NP CT CTX\n", "}\n" ] } ], "source": [ "print(forge.resolve(hard_name, scope=\"ontology\", strategy=\"EXACT_MATCH\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "when using lower cases, it will return None" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(forge.resolve(\"270_L5/6 np CT CTX\", scope=\"ontology\", strategy=\"EXACT_MATCH\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exact case-insensitive match" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/celltypes/Lamp_plus\n", " type: Class\n", " label: Lamp+\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"lamp+\", scope=\"ontology\", strategy=\"EXACT_CASE_INSENSITIVE_MATCH\"))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/celltypes/Lamp_plus\n", " type: Class\n", " label: Lamp+\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"lamp+\", scope=\"ontology\", strategy=\"EXACT_CASE_INSENSITIVE_MATCH\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "in this case using the case-insensitive match will find the cell type" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/ttypes/270_L5_6_NP_CT_CTX\n", " type: Class\n", " label: 270_L5/6 NP CT CTX\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"270_L5/6 np CT CTx\", scope=\"ontology\", strategy=\"EXACT_CASE_INSENSITIVE_MATCH\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Best match (default)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/ttypes/21_Sncg`\n", " type: Class\n", " label: 21_Sncg\n", "}\n" ] } ], "source": [ "print(forge.resolve(\"2\", scope=\"ontology\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### All matches" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: https://bbp.epfl.ch/ontologies/core/ttypes/21_Sncg`\n", " type: Class\n", " label: 21_Sncg\n", "}\n", "{\n", " id: https://bbp.epfl.ch/ontologies/core/ttypes/270_L5_6_NP_CT_CTX\n", " type: Class\n", " label: 270_L5/6 NP CT CTX\n", "}\n" ] } ], "source": [ "results = forge.resolve(\"2\", scope=\"ontology\", strategy=\"ALL_MATCHES\")\n", "print(*results, sep=\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resolving a Resource\n", "A kgforge.core.resource.Resource can be resolved. In such case and in addition to the other supported arguments, the resource property to resolve can be provided through the argument 'property_to_resolve'. The resolving result can be merge back in the input resource by setting the 'merge_inplace_as argument'. When 'merge_inplace_as' is not set then the results are returned as separate resources." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " type: Agent\n", " gender: mal\n", "}\n" ] } ], "source": [ "resource = Resource(type=\"Agent\", gender=\"mal\")\n", "print(resource)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "resource_resolved_merged = forge.resolve(resource, scope=\"terms\", target=\"sexontology\",\n", " strategy=ResolvingStrategy.ALL_MATCHES,\n", " property_to_resolve=\"gender\",\n", " merge_inplace_as=\"gender_resolved\",\n", " threshold=0.8)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "kgforge.core.resource.Resource" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(resource_resolved_merged)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " type: Agent\n", " gender: mal\n", " gender_resolved:\n", " [\n", " {\n", " id: http://purl.obolibrary.org/obo/PATO_0000384\n", " type: Class\n", " label: male\n", " }\n", " {\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", " }\n", " ]\n", "}\n" ] } ], "source": [ "print(resource_resolved_merged)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "resource_resolved_separated = forge.resolve(resource, scope=\"terms\", target=\"sexontology\",\n", " strategy=ResolvingStrategy.ALL_MATCHES,\n", " property_to_resolve=\"gender\",\n", " threshold=0.8)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(resource_resolved_separated)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(resource_resolved_separated)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000384\n", " type: Class\n", " label: male\n", "}\n", "{\n", " id: http://purl.obolibrary.org/obo/PATO_0000383\n", " type: Class\n", " label: female\n", "}\n" ] } ], "source": [ "print(*resource_resolved_separated, sep=\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EntityLinkerSkLearn Resolver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on a pretrained model and using [scikit-learn](https://scikit-learn.org/stable/index.html) to generate and rank candidates." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/miniconda3/envs/kgforge/lib/python3.7/site-packages/sklearn/base.py:338: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations\n", " UserWarning,\n", "/opt/miniconda3/envs/kgforge/lib/python3.7/site-packages/sklearn/base.py:338: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations\n", " UserWarning,\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{\n", " id: http://schema.org/Person\n", " label: Person\n", " altLabel: Person\n", " definition: A person (alive, dead, undead, or fictional).\n", " score: 0.0\n", "}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/miniconda3/envs/kgforge/lib/python3.7/site-packages/sklearn/base.py:338: UserWarning: Trying to unpickle estimator NearestNeighbors from version 0.23.2 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations\n", " UserWarning,\n" ] } ], "source": [ "print(forge.resolve(\"person\", scope=\"schemaorg\", target=\"terms\", strategy=ResolvingStrategy.BEST_MATCH))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.7.13 ('kgforge')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" }, "vscode": { "interpreter": { "hash": "9ac393a5ddd595f2c78ea58b15bf8d269850a4413729cbea5c5fae9013762763" } } }, "nbformat": 4, "nbformat_minor": 4 }