{ "cells": [ { "cell_type": "markdown", "id": "c4ee8736-2cae-487b-b722-78dbbe2f556d", "metadata": {}, "source": [ "# KGTK Browser Cache Setup" ] }, { "cell_type": "markdown", "id": "d8198f9d-7aad-417d-972b-b10514faf242", "metadata": {}, "source": [ "This note book will create the SQLite DB Cache and the required indices for KGTK Browser.\n", "\n", "The required input parameters are:\n", "- input_path: Path where the following files should be present\n", " 1. labels.en.tsv.gz\n", " 2. aliases.en.tsv.gz\n", " 3. descriptions.en.tsv.gz\n", " 4. claims.tsv.gz\n", " 5. metadata.property.datatypes.tsv.gz\n", " 6. qualifiers.tsv.gz\n", " 7. metadata.pagerank.undirected.tsv.gz\n", " 8. class-visualization.edge.tsv.gz **# optional, required for class visualization graph**\n", " 9. class-visualization.node.tsv.gz **# optional, required for class visualization graph**\n", " 10. derived.isastar.tsv.gz **# This file is required for creating the ES index only.**\n", "\n", "The files `metadata.pagerank.undirected.tsv.gz` and `derived.isastar.tsv.gz` are created by running [this](https://github.com/usc-isi-i2/kgtk/blob/dev/use-cases/Wikidata%20Useful%20Files.ipynb) notebook\n", "- output_path: Output path \n", "- project_name: folder inside the `output_path` where the required files and cache will be created\n", "- es_host: ES host, default `localhost`\n", "- es_port: ES port, default `9200`\n", "- es_index: name of the desired elasticsearch index, default `wikidata-dwd-kgtk-search-01`\n", "- create_db: variable to control creation of sqlite database cache. `yes|no`, default `yes`\n", "- create_es: variable to control creation of the ES index. `yes|no`, defaut `no`\n", "- create_class_viz: variable to control creation of class visualization tables in the sqlite database cache. Most probably, if you are not working with Wikidata, you will not have files # 8 and 9 listed above. Set this parameter to 'no' in that case. `yes|no`, default `yes`\n", "\n", "**Cache file location:** `//temp./wikidata.sqlite3.db`" ] }, { "cell_type": "code", "execution_count": 1, "id": "1a557f86-0d8a-4e89-9e76-4cedfafbf15a", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "from kgtk.configure_kgtk_notebooks import ConfigureKGTK\n", "import kgtk.kypher.api as kapi\n", "from kgtk.functions import kgtk, kypher" ] }, { "cell_type": "code", "execution_count": 2, "id": "3442d8b8-4db0-40a9-a6aa-2db2e4d5b377", "metadata": {}, "outputs": [], "source": [ "input_path = \"/data/amandeep/wikidata-20211027-dwd-v3\"\n", "output_path = \"/data/amandeep/wikidata-20211027-dwd-v3\"\n", "\n", "project_name = \"kgtk-browser-files\"\n", "\n", "files = 'label,pagerank_undirected,alias,description,claims,datatypes,qualifiers'\n", "\n", "es_host = \"http://localhost\"\n", "es_port = 9200\n", "es_index = \"wikidata-dwd-kgtk-search-01\"\n", "\n", "create_db = 'yes'\n", "create_es = 'no'\n", "create_class_viz = 'yes'" ] }, { "cell_type": "code", "execution_count": 3, "id": "9d7e7730-53a9-4307-b448-c5d4e0946968", "metadata": {}, "outputs": [], "source": [ "files = files.split(',')\n", "\n", "create_sqlite_cache = create_db.lower() == 'yes'\n", "create_es_index = create_es.lower() == 'yes'\n", "create_class_viz_tables = create_class_viz.lower() == 'yes'\n", "\n", "additional_files = {\n", " 'classvizedge': 'class-visualization.edge.tsv.gz',\n", " 'classviznode': 'class-visualization.node.tsv.gz'\n", "}" ] }, { "cell_type": "code", "execution_count": 4, "id": "5d7adfbe-aa7e-46d2-ad76-e78f39434bda", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User home: /nas/home/amandeep\n", "Current dir: /data/amandeep/github/kgtk-browser\n", "KGTK dir: /data/amandeep/github\n", "Use-cases dir: /data/amandeep/github/use-cases\n" ] } ], "source": [ "ck = ConfigureKGTK(files)\n", "if create_class_viz_tables:\n", " ck.configure_kgtk(input_graph_path=input_path,\n", " output_path=output_path,\n", " project_name=project_name,\n", " additional_files=additional_files)\n", "else:\n", " ck.configure_kgtk(input_graph_path=input_path,\n", " output_path=output_path,\n", " project_name=project_name)" ] }, { "cell_type": "code", "execution_count": 5, "id": "36bcea63-6cff-48fc-b06c-d0acc8346c80", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "USE_CASES_DIR: /data/amandeep/github/use-cases\n", "KGTK_OPTION_DEBUG: false\n", "kypher: kgtk query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files/wikidata.sqlite3.db\n", "TEMP: /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files\n", "GRAPH: /data/amandeep/wikidata-20211027-dwd-v3\n", "kgtk: kgtk\n", "EXAMPLES_DIR: /data/amandeep/github/examples\n", "STORE: /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files/wikidata.sqlite3.db\n", "KGTK_GRAPH_CACHE: /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files/wikidata.sqlite3.db\n", "OUT: /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files\n", "KGTK_LABEL_FILE: /data/amandeep/wikidata-20211027-dwd-v3/labels.en.tsv.gz\n", "label: /data/amandeep/wikidata-20211027-dwd-v3/labels.en.tsv.gz\n", "pagerank_undirected: /data/amandeep/wikidata-20211027-dwd-v3/metadata.pagerank.undirected.tsv.gz\n", "alias: /data/amandeep/wikidata-20211027-dwd-v3/aliases.en.tsv.gz\n", "description: /data/amandeep/wikidata-20211027-dwd-v3/descriptions.en.tsv.gz\n", "claims: /data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz\n", "datatypes: /data/amandeep/wikidata-20211027-dwd-v3/metadata.property.datatypes.tsv.gz\n", "qualifiers: /data/amandeep/wikidata-20211027-dwd-v3/qualifiers.tsv.gz\n", "classvizedge: /data/amandeep/wikidata-20211027-dwd-v3/class-visualization.edge.tsv.gz\n", "classviznode: /data/amandeep/wikidata-20211027-dwd-v3/class-visualization.node.tsv.gz\n" ] } ], "source": [ "ck.print_env_variables()" ] }, { "cell_type": "markdown", "id": "99a8c8ef-3617-4f40-adfd-f2da816216c2", "metadata": {}, "source": [ "## Load the files into cache" ] }, { "cell_type": "code", "execution_count": 6, "id": "a7683f2e-49ee-4b4c-861e-587173380a6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "kgtk query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files/wikidata.sqlite3.db -i \"/data/amandeep/wikidata-20211027-dwd-v3/labels.en.tsv.gz\" --as label -i \"/data/amandeep/wikidata-20211027-dwd-v3/metadata.pagerank.undirected.tsv.gz\" --as pagerank_undirected -i \"/data/amandeep/wikidata-20211027-dwd-v3/aliases.en.tsv.gz\" --as alias -i \"/data/amandeep/wikidata-20211027-dwd-v3/descriptions.en.tsv.gz\" --as description -i \"/data/amandeep/wikidata-20211027-dwd-v3/claims.tsv.gz\" --as claims -i \"/data/amandeep/wikidata-20211027-dwd-v3/metadata.property.datatypes.tsv.gz\" --as datatypes -i \"/data/amandeep/wikidata-20211027-dwd-v3/qualifiers.tsv.gz\" --as qualifiers -i \"/data/amandeep/wikidata-20211027-dwd-v3/class-visualization.edge.tsv.gz\" --as classvizedge -i \"/data/amandeep/wikidata-20211027-dwd-v3/class-visualization.node.tsv.gz\" --as classvizenode --limit 3\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-label-en\tP10\tlabel\t'video'@en\ten\t\t\n", "P1000-label-en\tP1000\tlabel\t'record held'@en\ten\t\t\n", "P10000-label-en\tP10000\tlabel\t'Research Vocabularies Australia ID'@en\ten\t\t\n" ] } ], "source": [ "if create_sqlite_cache:\n", " ck.load_files_into_cache()" ] }, { "cell_type": "markdown", "id": "bb65acc2-2a3d-4c8c-8e4f-d5167f9f84ae", "metadata": {}, "source": [ "## Define the Kypher API" ] }, { "cell_type": "code", "execution_count": 6, "id": "ba6552ee-2a6a-4893-b468-c1e15efd6885", "metadata": {}, "outputs": [], "source": [ "_kapi2 = kapi.KypherApi(graphcache=os.environ['STORE'], loglevel=1, index='auto',\n", " maxresults=100, maxcache=0)" ] }, { "cell_type": "markdown", "id": "97479a01-92b4-489c-91cc-e301907548ad", "metadata": {}, "source": [ "## Create a file with `label`, `undirected_pagerank` and `description`" ] }, { "cell_type": "code", "execution_count": 8, "id": "4fbcb99b-a4d9-4a24-aea3-93b0e7bbbc64", "metadata": {}, "outputs": [], "source": [ "if create_sqlite_cache:\n", " !kgtk query --gc $STORE \\\n", " -i label pagerank_undirected description\\\n", " --match 'label: (qnode)-[l]->(y), pagerank: (qnode)-[:Pundirected_pagerank]->(pr)' \\\n", " --opt 'description: (qnode)-[:description]->(d)' \\\n", " --return 'qnode as node1, l.label as label, y as node2, upper(y) as `node2;upper`, pr as `node1;pagerank`, ifnull(d, \"\") as `node1;description`' \\\n", " --order-by 'qnode' \\\n", " -o $OUT/label_pagerank_undirected_description.tsv.gz" ] }, { "cell_type": "markdown", "id": "d0ecdfcc-1ea4-4c83-8890-588349a6b57c", "metadata": {}, "source": [ "### Load this file into cache as well" ] }, { "cell_type": "code", "execution_count": 9, "id": "6ca3f1c3-91cd-4941-b6db-f6d0a887f35b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tnode2;upper\tnode1;pagerank\tnode1;description\n", "P10\tlabel\t'video'@en\t'VIDEO'@EN\t1.9177407615866818e-08\t'relevant video. For images, use the property P18. For film trailers, qualify with \\\"object has role\\\" (P3831)=\\\"trailer\\\" (Q622550)'@en\n", "P1000\tlabel\t'record held'@en\t'RECORD HELD'@EN\t1.4103331270065626e-08\t'notable record achieved by a person or entity, include qualifiers for dates held'@en\n", "P10000\tlabel\t'Research Vocabularies Australia ID'@en\t'RESEARCH VOCABULARIES AUSTRALIA ID'@EN\t2.586054821150431e-08\t'Identifier of a vocabulary in Research Vocabularies Australia'@en\n", "P10001\tlabel\t'Austrian Football Association player ID'@en\t'AUSTRIAN FOOTBALL ASSOCIATION PLAYER ID'@EN\t3.152198892346769e-08\t'identifier for an association football player on the Austrian Football Association website'@en\n", "P10002\tlabel\t'Dewan Negara ID'@en\t'DEWAN NEGARA ID'@EN\t2.3820406824266033e-08\t'numerical identifier for members or former members of the Dewan Negara of Malaysia on the official website'@en\n", "P10003\tlabel\t'Arachne.org.au ID'@en\t'ARACHNE.ORG.AU ID'@EN\t2.355944886245409e-08\t'numerical identifier for an arachnid taxon in the Arachne.org.au database'@en\n", "P10004\tlabel\t'Dewan Rakyat ID'@en\t'DEWAN RAKYAT ID'@EN\t2.3399165925456584e-08\t'numerical identifier for members or former members of the Dewan Rakyat of Malaysia on the official website'@en\n", "P10005\tlabel\t'Remontees-mecaniques.net ID'@en\t'REMONTEES-MECANIQUES.NET ID'@EN\t2.1853431037247363e-08\t'numerical identifier for a cableway on the French website Remontees-mecaniques.net'@en\n", "P10006\tlabel\t'AllSides ID'@en\t'ALLSIDES ID'@EN\t1.8609454548988348e-08\t'alphabetical identifier of a publication, organization, or person at AllSides'@en\n", "P10007\tlabel\t'Birdata ID'@en\t'BIRDATA ID'@EN\t2.1809764677402115e-08\t'identifier for an bird species in the Birdata database'@en\n" ] } ], "source": [ "if create_sqlite_cache:\n", " !kgtk query --gc $STORE -i $OUT/label_pagerank_undirected_description.tsv.gz --as l_d_pgr_ud --limit 10" ] }, { "cell_type": "markdown", "id": "49ba5559-1550-4c53-b5cd-7a7084164b61", "metadata": {}, "source": [ "## Create the required indices" ] }, { "cell_type": "code", "execution_count": 10, "id": "03aa07dd-5a8d-4451-acaf-41337978ce13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 16:57:25 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_10 AS graph_10_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 16:57:25 sqlstore]: CREATE INDEX \"graph_10_node1_idx\" ON \"graph_10\" (\"node1\")\n", "[2022-01-13 16:57:54 sqlstore]: ANALYZE \"graph_10_node1_idx\"\n", "[2022-01-13 16:57:58 sqlstore]: CREATE INDEX \"graph_10_node2;upper_idx\" ON \"graph_10\" (\"node2;upper\")\n", "[2022-01-13 16:59:15 sqlstore]: ANALYZE \"graph_10_node2;upper_idx\"\n", "[2022-01-13 16:59:21 sqlstore]: CREATE INDEX \"graph_10_label_idx\" ON \"graph_10\" (\"label\")\n", "[2022-01-13 16:59:45 sqlstore]: ANALYZE \"graph_10_label_idx\"\n", "[2022-01-13 16:59:49 sqlstore]: CREATE VIRTUAL TABLE \"graph_10_txtidx_ldpgridx\" USING FTS5 (\"node2\", tokenize=\"trigram\", content=\"graph_10\")\n", "[2022-01-13 16:59:49 sqlstore]: INSERT INTO \"graph_10_txtidx_ldpgridx\" (\"node2\") SELECT \"node2\" FROM graph_10\n", "node1\tlabel\tnode2\tnode2;upper\tnode1;pagerank\tnode1;description\n", "P10\tlabel\t'video'@en\t'VIDEO'@EN\t1.9177407615866818e-08\t'relevant video. For images, use the property P18. For film trailers, qualify with \\\"object has role\\\" (P3831)=\\\"trailer\\\" (Q622550)'@en\n", "P1000\tlabel\t'record held'@en\t'RECORD HELD'@EN\t1.4103331270065626e-08\t'notable record achieved by a person or entity, include qualifiers for dates held'@en\n", "P10000\tlabel\t'Research Vocabularies Australia ID'@en\t'RESEARCH VOCABULARIES AUSTRALIA ID'@EN\t2.586054821150431e-08\t'Identifier of a vocabulary in Research Vocabularies Australia'@en\n", "P10001\tlabel\t'Austrian Football Association player ID'@en\t'AUSTRIAN FOOTBALL ASSOCIATION PLAYER ID'@EN\t3.152198892346769e-08\t'identifier for an association football player on the Austrian Football Association website'@en\n", "P10002\tlabel\t'Dewan Negara ID'@en\t'DEWAN NEGARA ID'@EN\t2.3820406824266033e-08\t'numerical identifier for members or former members of the Dewan Negara of Malaysia on the official website'@en\n", "CPU times: user 12.3 s, sys: 2.48 s, total: 14.8 s\n", "Wall time: 11min 51s\n" ] } ], "source": [ "%%time \n", "if create_sqlite_cache:\n", " !kgtk --debug query -i l_d_pgr_ud --idx node1 \"node2;upper\" label text:node2//name=ldpgridx --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 11, "id": "e33e4d9d-e0da-4d29-8abf-5bada3316b88", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 17:09:18 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_1 AS graph_1_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 17:09:18 sqlstore]: CREATE INDEX \"graph_1_label_idx\" ON \"graph_1\" (\"label\")\n", "[2022-01-13 17:09:41 sqlstore]: ANALYZE \"graph_1_label_idx\"\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-label-en\tP10\tlabel\t'video'@en\ten\t\t\n", "P1000-label-en\tP1000\tlabel\t'record held'@en\ten\t\t\n", "P10000-label-en\tP10000\tlabel\t'Research Vocabularies Australia ID'@en\ten\t\t\n", "P10001-label-en\tP10001\tlabel\t'Austrian Football Association player ID'@en\ten\t\t\n", "P10002-label-en\tP10002\tlabel\t'Dewan Negara ID'@en\ten\t\t\n", "CPU times: user 665 ms, sys: 138 ms, total: 803 ms\n", "Wall time: 37 s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache:\n", " !kgtk --debug query -i label --idx label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 12, "id": "85f0a37f-0a55-4187-99d6-e251b7f095cf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 17:09:51 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_3 AS graph_3_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 17:09:51 sqlstore]: CREATE INDEX \"graph_3_label_idx\" ON \"graph_3\" (\"label\")\n", "[2022-01-13 17:10:00 sqlstore]: ANALYZE \"graph_3_label_idx\"\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-alias-en-282226-0\tP10\talias\t'gif'@en\ten\t\t\n", "P10-alias-en-2f86d8-0\tP10\talias\t'animation'@en\ten\t\t\n", "P10-alias-en-c1427e-0\tP10\talias\t'media'@en\ten\t\t\n", "P10-alias-en-c61ab1-0\tP10\talias\t'trailer (Commons)'@en\ten\t\t\n", "P10000-alias-en-0df7f5-0\tP10000\talias\t'Australian Research Vocabularies ID'@en\ten\t\t\n", "CPU times: user 294 ms, sys: 62.9 ms, total: 357 ms\n", "Wall time: 16.3 s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache:\n", " !kgtk --debug query -i alias --idx label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 13, "id": "cd728937-6ffd-406f-988a-6dc7c38cb6ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 17:10:07 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_4 AS graph_4_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 17:10:07 sqlstore]: CREATE INDEX \"graph_4_id_idx\" ON \"graph_4\" (\"id\")\n", "[2022-01-13 17:10:41 sqlstore]: ANALYZE \"graph_4_id_idx\"\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-description-en\tP10\tdescription\t'relevant video. For images, use the property P18. For film trailers, qualify with \\\"object has role\\\" (P3831)=\\\"trailer\\\" (Q622550)'@en\ten\t\t\n", "P1000-description-en\tP1000\tdescription\t'notable record achieved by a person or entity, include qualifiers for dates held'@en\ten\t\t\n", "P10000-description-en\tP10000\tdescription\t'Identifier of a vocabulary in Research Vocabularies Australia'@en\ten\t\t\n", "P10001-description-en\tP10001\tdescription\t'identifier for an association football player on the Austrian Football Association website'@en\ten\t\t\n", "P10002-description-en\tP10002\tdescription\t'numerical identifier for members or former members of the Dewan Negara of Malaysia on the official website'@en\ten\t\t\n", "CPU times: user 746 ms, sys: 196 ms, total: 942 ms\n", "Wall time: 46.4 s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache:\n", " !kgtk --debug query -i description --idx id --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 14, "id": "c67f7305-8008-4ac0-bd0c-dd76186a8720", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 17:10:55 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_5 AS graph_5_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 17:10:55 sqlstore]: CREATE INDEX \"graph_5_label_idx\" ON \"graph_5\" (\"label\")\n", "[2022-01-13 17:25:12 sqlstore]: ANALYZE \"graph_5_label_idx\"\n", "[2022-01-13 17:25:48 sqlstore]: CREATE INDEX \"graph_5_node1_idx\" ON \"graph_5\" (\"node1\")\n", "[2022-01-13 17:31:45 sqlstore]: ANALYZE \"graph_5_node1_idx\"\n", "[2022-01-13 17:32:25 sqlstore]: CREATE INDEX \"graph_5_node2_idx\" ON \"graph_5\" (\"node2\")\n", "[2022-01-13 17:46:35 sqlstore]: ANALYZE \"graph_5_node2_idx\"\n", "[2022-01-13 17:47:20 sqlstore]: CREATE INDEX \"graph_5_id_idx\" ON \"graph_5\" (\"id\")\n", "[2022-01-13 17:56:45 sqlstore]: ANALYZE \"graph_5_id_idx\"\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-P1628-32b85d-7927ece6-0\tP10\tP1628\t\"http://www.w3.org/2006/vcard/ns#Video\"\t\tnormal\turl\n", "P10-P1628-acf60d-b8950832-0\tP10\tP1628\t\"https://schema.org/video\"\t\tnormal\turl\n", "P10-P1629-Q34508-bcc39400-0\tP10\tP1629\tQ34508\t\tnormal\twikibase-item\n", "P10-P1630-53947a-fbe9093e-0\tP10\tP1630\t\"https://commons.wikimedia.org/wiki/File:$1\"\t\tnormal\tstring\n", "P10-P1659-P1651-c4068028-0\tP10\tP1659\tP1651\t\tnormal\twikibase-property\n", "CPU times: user 50.2 s, sys: 9.39 s, total: 59.6 s\n", "Wall time: 46min 58s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache:\n", " !kgtk --debug query -i claims --idx label node1 node2 id --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 15, "id": "726beb5e-05b0-45ca-b56f-738a147f5536", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 17:58:10 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_6 AS graph_6_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 17:58:10 sqlstore]: CREATE INDEX \"graph_6_label_idx\" ON \"graph_6\" (\"label\")\n", "[2022-01-13 17:58:10 sqlstore]: ANALYZE \"graph_6_label_idx\"\n", "[2022-01-13 17:58:10 sqlstore]: CREATE INDEX \"graph_6_node1_idx\" ON \"graph_6\" (\"node1\")\n", "[2022-01-13 17:58:10 sqlstore]: ANALYZE \"graph_6_node1_idx\"\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-datatype\tP10\tdatatype\tcommonsMedia\t\t\t\n", "P1000-datatype\tP1000\tdatatype\twikibase-item\t\t\t\n", "P10000-datatype\tP10000\tdatatype\texternal-id\t\t\t\n", "P10001-datatype\tP10001\tdatatype\texternal-id\t\t\t\n", "P10002-datatype\tP10002\tdatatype\texternal-id\t\t\t\n", "CPU times: user 474 ms, sys: 86.5 ms, total: 561 ms\n", "Wall time: 24.3 s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache:\n", " !kgtk --debug query -i datatypes --idx label node1 --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 16, "id": "a37dc7b4-16aa-468a-9de6-f58064bcfca3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 17:58:16 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_7 AS graph_7_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 17:58:16 sqlstore]: CREATE INDEX \"graph_7_node2_idx\" ON \"graph_7\" (\"node2\")\n", "[2022-01-13 18:03:02 sqlstore]: ANALYZE \"graph_7_node2_idx\"\n", "[2022-01-13 18:03:13 sqlstore]: CREATE INDEX \"graph_7_node1_idx\" ON \"graph_7\" (\"node1\")\n", "[2022-01-13 18:05:21 sqlstore]: ANALYZE \"graph_7_node1_idx\"\n", "[2022-01-13 18:05:36 sqlstore]: CREATE INDEX \"graph_7_label_idx\" ON \"graph_7\" (\"label\")\n", "[2022-01-13 18:07:25 sqlstore]: ANALYZE \"graph_7_label_idx\"\n", "id\tnode1\tlabel\tnode2\tlang\trank\tnode2;wikidatatype\n", "P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0\tP10-P1855-Q15075950-7eff6d65-0\tP10\t\"Smoorverliefd 12 september.webm\"\t\t\tcommonsMedia\n", "P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0\tP10-P1855-Q15075950-7eff6d65-0\tP3831\tQ622550\t\t\twikibase-item\n", "P10-P1855-Q4504-a69d2c73-0-P10-bef003-0\tP10-P1855-Q4504-a69d2c73-0\tP10\t\"Komodo dragons video.ogv\"\t\t\tcommonsMedia\n", "P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0\tP10-P1855-Q69063653-c8cdb04c-0\tP10\t\"Couch Commander.webm\"\t\t\tcommonsMedia\n", "P10-P1855-Q7378-555592a4-0-P10-8a982d-0\tP10-P1855-Q7378-555592a4-0\tP10\t\"Elephants Dream (2006).webm\"\t\t\tcommonsMedia\n", "CPU times: user 10.2 s, sys: 1.85 s, total: 12.1 s\n", "Wall time: 9min 24s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache:\n", " !kgtk --debug query -i qualifiers --idx node2 node1 label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 17, "id": "74875c90-0028-4229-b01c-1a2892f01b56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 18:07:44 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_8 AS graph_8_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 18:07:44 sqlstore]: CREATE INDEX \"graph_8_label_idx\" ON \"graph_8\" (\"label\")\n", "[2022-01-13 18:09:31 sqlstore]: ANALYZE \"graph_8_label_idx\"\n", "node1\tlabel\tnode2\tgraph\tedge_type\n", "Q10267817\tP279\tQ18553442\tQ1225194\tsubclass\n", "Q107715\tP279\tQ309314\tQ246672\tsubclass\n", "Q107715\tP279\tQ309314\tQ937228\tsubclass\n", "Q107715\tP279\tQ309314\tQ7184903\tsubclass\n", "Q107715\tP279\tQ309314\tQ35120\tsubclass\n", "CPU times: user 2.18 s, sys: 463 ms, total: 2.64 s\n", "Wall time: 2min 5s\n" ] } ], "source": [ "%%time\n", "if create_sqlite_cache and create_class_viz_tables:\n", " !kgtk --debug query -i classvizedge --idx label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": null, "id": "29b64553-be21-4b19-8f76-a1016abea27b", "metadata": {}, "outputs": [], "source": [ "%%time\n", "if create_sqlite_cache and create_class_viz_tables:\n", " !kgtk --debug query -i classvizedge --idx index:graph --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": null, "id": "bae7f7f1-fe72-468f-b1af-54c69e2da775", "metadata": {}, "outputs": [], "source": [ "%%time\n", "if create_sqlite_cache and create_class_viz_tables:\n", " !kgtk --debug query -i classviznode --idx label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": null, "id": "cd4c1213-cee9-457e-9b89-4772a80c4697", "metadata": {}, "outputs": [], "source": [ "%%time\n", "if create_sqlite_cache and create_class_viz_tables:\n", " !kgtk --debug query -i classviznode --idx index:graph --gc $STORE --limit 5" ] }, { "cell_type": "markdown", "id": "8f9680b8-0248-4ef2-ae50-4613e51e0a48", "metadata": {}, "source": [ "## Take a look at cache file content" ] }, { "cell_type": "code", "execution_count": 21, "id": "b2090fe2-e018-4edb-a50f-49326479e1c2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Graph Cache:\n", "DB file: /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files/wikidata.sqlite3.db\n", " size: 193.02 GB \tfree: 0 Bytes \tmodified: 2022-01-13 18:27:42\n", "\n", "KGTK File Information:\n", "alias:\n", " size: 156.76 MB \tmodified: 2021-11-20 07:45:05 \tgraph: graph_3\n", "claims:\n", " size: 10.17 GB \tmodified: 2021-11-20 07:29:41 \tgraph: graph_5\n", "classvizedge:\n", " size: 416.38 MB \tmodified: 2021-12-31 08:30:22 \tgraph: graph_8\n", "classvizenode:\n", " size: 383.09 MB \tmodified: 2021-12-31 09:03:51 \tgraph: graph_9\n", "classviznode:\n", " size: 383.09 MB \tmodified: 2021-12-31 09:03:51 \tgraph: graph_11\n", "datatypes:\n", " size: 48.81 KB \tmodified: 2021-11-20 03:43:23 \tgraph: graph_6\n", "description:\n", " size: 371.43 MB \tmodified: 2021-11-20 09:00:26 \tgraph: graph_4\n", "l_d_pgr_ud:\n", " size: 1.27 GB \tmodified: 2022-01-13 16:54:45 \tgraph: graph_10\n", "label:\n", " size: 614.97 MB \tmodified: 2021-11-20 09:27:57 \tgraph: graph_1\n", "pagerank_undirected:\n", " size: 740.37 MB \tmodified: 2021-11-20 14:56:23 \tgraph: graph_2\n", "qualifiers:\n", " size: 2.30 GB \tmodified: 2021-11-20 07:38:37 \tgraph: graph_7\n", "\n", "Graph Table Information:\n", "graph_1:\n", " size: 4.62 GB \tcreated: 2022-01-13 15:49:06\n", " header: ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']\n", "graph_10:\n", " size: 12.36 GB \tcreated: 2022-01-13 16:56:45\n", " header: ['node1', 'label', 'node2', 'node2;upper', 'node1;pagerank', 'node1;description']\n", "graph_11:\n", " size: 17.50 GB \tcreated: 2022-01-13 18:25:50\n", " header: ['node1', 'graph', 'instance_count', 'node_type', 'label', 'tooltip']\n", "graph_2:\n", " size: 7.82 GB \tcreated: 2022-01-13 15:51:06\n", " header: ['node1', 'label', 'node2', 'id']\n", "graph_3:\n", " size: 896.11 MB \tcreated: 2022-01-13 15:51:27\n", " header: ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']\n", "graph_4:\n", " size: 6.29 GB \tcreated: 2022-01-13 15:52:55\n", " header: ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']\n", "graph_5:\n", " size: 95.24 GB \tcreated: 2022-01-13 16:13:20\n", " header: ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']\n", "graph_6:\n", " size: 744.00 KB \tcreated: 2022-01-13 16:13:21\n", " header: ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']\n", "graph_7:\n", " size: 25.23 GB \tcreated: 2022-01-13 16:19:19\n", " header: ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']\n", "graph_8:\n", " size: 8.52 GB \tcreated: 2022-01-13 16:22:30\n", " header: ['node1', 'label', 'node2', 'graph', 'edge_type']\n", "graph_9:\n", " size: 14.55 GB \tcreated: 2022-01-13 16:27:25\n", " header: ['node1', 'graph', 'instance_count', 'node_type', 'label', 'tooltip']\n" ] } ], "source": [ "!kgtk query --gc $STORE --show-cache " ] }, { "cell_type": "markdown", "id": "b18dbed3-eed3-429c-bbb9-975d9380b18a", "metadata": {}, "source": [ "## Define a function to do a `textmatch` search" ] }, { "cell_type": "code", "execution_count": 22, "id": "df5fa9ad-207a-46b8-9de6-b697f4e9e6e0", "metadata": {}, "outputs": [], "source": [ "def text_search_labels(search_text, limit=20):\n", " text_search_labels_query = _kapi2.get_query(\n", " doc=\"Doc string here\",\n", " name=f\"text_search_labels_{search_text}\",\n", " inputs='l_d_pgr_ud',\n", " match='l_d_pgr_ud: (qnode)-[l:label]->(y)',\n", " where=f'textmatch(y, \"{search_text}\")',\n", " ret='distinct qnode as node1, y as label, 10*matchscore(y) as score, cast(l.`node1;pagerank`, float) as prank, l.`node1;description` as description',\n", " order='score*prank',\n", " limit=limit\n", " )\n", " results = list([list(x) for x in text_search_labels_query.execute()])\n", " df = pd.DataFrame(results, columns=['node1', 'label', 'score', 'pagerank', 'description'])\n", " print(len(df))\n", " return df\n", " " ] }, { "cell_type": "code", "execution_count": 23, "id": "0c4b31bc-b8cf-4dd1-9071-b9bb54b89b95", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2022-01-13 18:29:32 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_10_c1.\"node1\" \"_aLias.node1\", graph_10_c1.\"node2\" \"_aLias.label\", (? * BM25(txtidx_1.graph_10_txtidx_ldpgridx)) \"_aLias.score\", CAST(graph_10_c1.\"node1;pagerank\" AS float) \"_aLias.prank\", graph_10_c1.\"node1;description\" \"_aLias.description\"\n", " FROM graph_10 AS graph_10_c1, graph_10_txtidx_ldpgridx AS txtidx_1\n", " WHERE graph_10_c1.\"label\" = ?\n", " AND txtidx_1.\"node2\" MATCH ? and txtidx_1.rowid = graph_10_c1.rowid\n", " ORDER BY (\"_aLias.score\" * \"_aLias.prank\") ASC\n", " LIMIT ?\n", " PARAS: [10, 'label', 'arn sch', 20]\n", "---------------------------------------------\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelscorepagerankdescription
0Q154770'Arnold Schoenberg'@en-107.4056192.983258e-07'Austrian-American composer (1874-1951)'@en
1Q2685'Arnold Schwarzenegger'@en-100.4261182.794892e-07'Austrian-American actor, businessman, bodybui...
2Q668095'Scharnstein'@en-119.9055701.907455e-07'municipality in Austria'@en
3Q506494'Schäftlarn'@en-122.2773591.862084e-07'municipality of Germany'@en
4Q723261'Scharn'@en-132.7834381.588464e-07'neighborhood in Maastricht, Netherlands'@en
5Q664826'Scharnitz'@en-124.7448721.239395e-07'municipality in Austria'@en
6Q522320'Arnschwang'@en-122.2773591.133833e-07'municipality of Germany'@en
7Q5659790'Harnischia'@en-122.2773591.014908e-07'genus of insects'@en
8Q2667816'Scharnegoutum'@en-115.4277131.051748e-07'human settlement in the Netherlands'@en
9Q77325'Arno Schmidt'@en-117.6240409.288887e-08'German poet'@en
10Q509737'Scharnhorst Order'@en-107.4056198.702578e-08'award'@en
11Q678395'Scharndorf'@en-122.2773597.542544e-08'municipality in Austria'@en
12Q18931623'Adelbert van Scharnlaan'@en-97.2658238.584699e-08'street in Maastricht, the Netherlands'@en
13Q27672057'Arno Schley'@en-119.9055706.687831e-08'badminton player'@en
14Q4861418'Barnard Castle School'@en-100.4261187.964473e-08'school in County Durham, UK'@en
15Q632660'Scharnebeck'@en-119.9055706.320210e-08'municipality of Germany'@en
16Q19508777'Scharnerweg'@en-119.9055705.883063e-08'street in Maastricht, the Netherlands'@en
17Q432443'Jan Arnoldus Schouten'@en-100.4261186.910608e-08'Dutch politician (1883-1971)'@en
18Q27530806'Arnd Rösch'@en-122.2773595.620266e-08'mathematician, chess player'@en
19Q78106'Arnold Schönhage'@en-109.3047585.492092e-08'German mathematician'@en
\n", "
" ], "text/plain": [ " node1 label score pagerank \\\n", "0 Q154770 'Arnold Schoenberg'@en -107.405619 2.983258e-07 \n", "1 Q2685 'Arnold Schwarzenegger'@en -100.426118 2.794892e-07 \n", "2 Q668095 'Scharnstein'@en -119.905570 1.907455e-07 \n", "3 Q506494 'Schäftlarn'@en -122.277359 1.862084e-07 \n", "4 Q723261 'Scharn'@en -132.783438 1.588464e-07 \n", "5 Q664826 'Scharnitz'@en -124.744872 1.239395e-07 \n", "6 Q522320 'Arnschwang'@en -122.277359 1.133833e-07 \n", "7 Q5659790 'Harnischia'@en -122.277359 1.014908e-07 \n", "8 Q2667816 'Scharnegoutum'@en -115.427713 1.051748e-07 \n", "9 Q77325 'Arno Schmidt'@en -117.624040 9.288887e-08 \n", "10 Q509737 'Scharnhorst Order'@en -107.405619 8.702578e-08 \n", "11 Q678395 'Scharndorf'@en -122.277359 7.542544e-08 \n", "12 Q18931623 'Adelbert van Scharnlaan'@en -97.265823 8.584699e-08 \n", "13 Q27672057 'Arno Schley'@en -119.905570 6.687831e-08 \n", "14 Q4861418 'Barnard Castle School'@en -100.426118 7.964473e-08 \n", "15 Q632660 'Scharnebeck'@en -119.905570 6.320210e-08 \n", "16 Q19508777 'Scharnerweg'@en -119.905570 5.883063e-08 \n", "17 Q432443 'Jan Arnoldus Schouten'@en -100.426118 6.910608e-08 \n", "18 Q27530806 'Arnd Rösch'@en -122.277359 5.620266e-08 \n", "19 Q78106 'Arnold Schönhage'@en -109.304758 5.492092e-08 \n", "\n", " description \n", "0 'Austrian-American composer (1874-1951)'@en \n", "1 'Austrian-American actor, businessman, bodybui... \n", "2 'municipality in Austria'@en \n", "3 'municipality of Germany'@en \n", "4 'neighborhood in Maastricht, Netherlands'@en \n", "5 'municipality in Austria'@en \n", "6 'municipality of Germany'@en \n", "7 'genus of insects'@en \n", "8 'human settlement in the Netherlands'@en \n", "9 'German poet'@en \n", "10 'award'@en \n", "11 'municipality in Austria'@en \n", "12 'street in Maastricht, the Netherlands'@en \n", "13 'badminton player'@en \n", "14 'school in County Durham, UK'@en \n", "15 'municipality of Germany'@en \n", "16 'street in Maastricht, the Netherlands'@en \n", "17 'Dutch politician (1883-1971)'@en \n", "18 'mathematician, chess player'@en \n", "19 'German mathematician'@en " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_search_labels('arn sch')" ] }, { "cell_type": "markdown", "id": "14958805-395c-4282-9701-80095827a512", "metadata": {}, "source": [ "## Define a function to search for Qnodes Exactly" ] }, { "cell_type": "code", "execution_count": 24, "id": "d6f0a3f1-f83a-4a30-a05a-51969d1762a3", "metadata": {}, "outputs": [], "source": [ "def exact_search_items(search_text, limit=20):\n", " search_text = search_text.upper()\n", " text_search_labels_query = _kapi2.get_query(\n", " doc=\"\"\"\n", " Create the Kypher query used by 'BrowserBackend.get_node_labels()'\n", " for case_independent searches.\n", " Given parameters 'NODE' and 'LANG' retrieve labels for 'NODE' in\n", " the specified language (using 'any' for 'LANG' retrieves all labels).\n", " Return distinct 'node1', 'node_label' pairs as the result (we include\n", " 'NODE' as an output to make it easier to union result frames).\n", " \"\"\",\n", " name=f'exact_search_items{search_text}',\n", " inputs='l_d_pgr_ud',\n", " match='l_d_pgr_ud: (n)-[r:label]->(l)',\n", " where=f'n=\"{search_text}\"',\n", " ret='distinct n as node1, l as node_label, r.`node1;description` as description',\n", ")\n", " results = list([list(x) for x in text_search_labels_query.execute()])\n", " df = pd.DataFrame(results, columns=['node1', 'label', 'description'])\n", " print(len(df))\n", " return df\n", " " ] }, { "cell_type": "code", "execution_count": 25, "id": "21de8b25-446f-43ed-83a4-36ac92e3410f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2022-01-13 18:30:09 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_10_c1.\"node1\" \"_aLias.node1\", graph_10_c1.\"node2\" \"_aLias.node_label\", graph_10_c1.\"node1;description\" \"_aLias.description\"\n", " FROM graph_10 AS graph_10_c1\n", " WHERE graph_10_c1.\"label\" = ?\n", " AND (graph_10_c1.\"node1\" = ?)\n", " LIMIT ?\n", " PARAS: ['label', 'Q30', 100]\n", "---------------------------------------------\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1\n", "CPU times: user 34 ms, sys: 1.83 ms, total: 35.8 ms\n", "Wall time: 59.7 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labeldescription
0Q30'United States of America'@en'sovereign state in North America'@en
\n", "
" ], "text/plain": [ " node1 label description\n", "0 Q30 'United States of America'@en 'sovereign state in North America'@en" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "exact_search_items('q30')" ] }, { "cell_type": "code", "execution_count": 26, "id": "25e06b83-e5d4-4de5-a0ad-2defd4caf18a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2022-01-13 18:30:10 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_10_c1.\"node1\" \"_aLias.node1\", graph_10_c1.\"node2\" \"_aLias.node_label\", graph_10_c1.\"node1;description\" \"_aLias.description\"\n", " FROM graph_10 AS graph_10_c1\n", " WHERE graph_10_c1.\"label\" = ?\n", " AND (graph_10_c1.\"node1\" = ?)\n", " LIMIT ?\n", " PARAS: ['label', 'Q140', 100]\n", "---------------------------------------------\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1\n", "CPU times: user 33.3 ms, sys: 5.52 ms, total: 38.8 ms\n", "Wall time: 63.1 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labeldescription
0Q140'lion'@en'species of big cat'@en
\n", "
" ], "text/plain": [ " node1 label description\n", "0 Q140 'lion'@en 'species of big cat'@en" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "exact_search_items('Q140')" ] }, { "cell_type": "markdown", "id": "b6478cb3-cd1f-4f6c-9056-5c78d889172b", "metadata": {}, "source": [ "## Define a function to search labels Exactly" ] }, { "cell_type": "code", "execution_count": 27, "id": "1c882fd8-2797-4dc1-8dc1-6ee3db485464", "metadata": {}, "outputs": [], "source": [ "def exact_search_labels(search_text, limit=20):\n", " search_text = f\"'{search_text.upper()}'@EN\"\n", " text_search_labels_query = _kapi2.get_query(\n", " doc=\"\"\"\n", " Exact Match case insensitive query\n", " \"\"\",\n", " name=f'exact_search_labels{search_text}',\n", " inputs='l_d_pgr_ud',\n", " match=f'l_d_pgr_ud: (n)-[r:label]->(l)',\n", " where=f'r.`node2;upper`=\"{search_text}\"',\n", " ret='distinct n as node1, l as node_label, cast(\"-1.0\", float) as score, cast(r.`node1;pagerank`, float) as prank, r.`node1;description` as description',\n", " order='score*prank',\n", " limit=limit\n", ")\n", " results = list([list(x) for x in text_search_labels_query.execute()])\n", " df = pd.DataFrame(results, columns=['node1', 'label', 'score', 'prank', 'description'])\n", " print(len(df))\n", " return df" ] }, { "cell_type": "code", "execution_count": 28, "id": "686b27db-8b2f-44df-aac8-f8d8224fa476", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2022-01-13 18:30:13 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_10_c1.\"node1\" \"_aLias.node1\", graph_10_c1.\"node2\" \"_aLias.node_label\", CAST(? AS float) \"_aLias.score\", CAST(graph_10_c1.\"node1;pagerank\" AS float) \"_aLias.prank\", graph_10_c1.\"node1;description\" \"_aLias.description\"\n", " FROM graph_10 AS graph_10_c1\n", " WHERE graph_10_c1.\"label\" = ?\n", " AND (graph_10_c1.\"node2;upper\" = ?)\n", " ORDER BY (\"_aLias.score\" * \"_aLias.prank\") ASC\n", " LIMIT ?\n", " PARAS: ['-1.0', 'label', \"'CANADA'@EN\", 20]\n", "---------------------------------------------\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "20\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelscoreprankdescription
0Q16'Canada'@en-1.01.052877e-03'sovereign state in North America'@en
1Q2569593'Canada'@en-1.08.219015e-08'French colony in New France'@en
2Q13265725'Canada'@en-1.02.816562e-08'family name'@en
3Q106501608'Canada'@en-1.01.475482e-08'Vanity Fair caricature of The Rt Hon Sir Wilf...
4Q13265795'Canada'@en-1.01.402956e-08'2012 novel by Richard Ford'@en
5Q2608363'Canada'@en-1.01.286213e-08'village in Appelscha, in the Netherlands'@en
6Q18612153'Canada'@en-1.01.278788e-08'180th strip of the webcomic xkcd'@en
7Q257304'Canada'@en-1.01.266340e-08'Wikimedia disambiguation page'@en
8Q103921530'Canada'@en-1.01.258331e-08'Shipwreck off the Scottish Coast, imported fr...
9Q14624136'Canada'@en-1.01.196157e-08'moth genus of Pteromalidae'@en
10Q5029265'Canada'@en-1.01.163423e-08'unincorporated community in Kansas, United St...
11Q99292858'Canada'@en-1.01.138884e-08'the country of Canada as depicted in Star Tre...
12Q107159786'Canada'@en-1.09.762842e-09'article in Problems of Empire'@en
13Q1031917'Canada'@en-1.09.468616e-09'unincorporated community in Pike County, Kent...
14Q18982533'Canada'@en-1.09.246475e-09'street in Elsloo, the Netherlands'@en
15Q5029289'Canada'@en-1.09.143933e-09'music group from Ann Arbor, Michigan, U.S.'@en
16Q93287969'Canada'@en-1.08.596045e-09'hamlet in Hampshire, UK'@en
17Q5029279'Canada'@en-1.08.290497e-09'full rigged ship built in 1891'@en
18Q28457929'Canada'@en-1.07.668673e-09'1786 ship'@en
19Q27722117'Canada'@en-1.07.489188e-09'scientific journal'@en
\n", "
" ], "text/plain": [ " node1 label score prank \\\n", "0 Q16 'Canada'@en -1.0 1.052877e-03 \n", "1 Q2569593 'Canada'@en -1.0 8.219015e-08 \n", "2 Q13265725 'Canada'@en -1.0 2.816562e-08 \n", "3 Q106501608 'Canada'@en -1.0 1.475482e-08 \n", "4 Q13265795 'Canada'@en -1.0 1.402956e-08 \n", "5 Q2608363 'Canada'@en -1.0 1.286213e-08 \n", "6 Q18612153 'Canada'@en -1.0 1.278788e-08 \n", "7 Q257304 'Canada'@en -1.0 1.266340e-08 \n", "8 Q103921530 'Canada'@en -1.0 1.258331e-08 \n", "9 Q14624136 'Canada'@en -1.0 1.196157e-08 \n", "10 Q5029265 'Canada'@en -1.0 1.163423e-08 \n", "11 Q99292858 'Canada'@en -1.0 1.138884e-08 \n", "12 Q107159786 'Canada'@en -1.0 9.762842e-09 \n", "13 Q1031917 'Canada'@en -1.0 9.468616e-09 \n", "14 Q18982533 'Canada'@en -1.0 9.246475e-09 \n", "15 Q5029289 'Canada'@en -1.0 9.143933e-09 \n", "16 Q93287969 'Canada'@en -1.0 8.596045e-09 \n", "17 Q5029279 'Canada'@en -1.0 8.290497e-09 \n", "18 Q28457929 'Canada'@en -1.0 7.668673e-09 \n", "19 Q27722117 'Canada'@en -1.0 7.489188e-09 \n", "\n", " description \n", "0 'sovereign state in North America'@en \n", "1 'French colony in New France'@en \n", "2 'family name'@en \n", "3 'Vanity Fair caricature of The Rt Hon Sir Wilf... \n", "4 '2012 novel by Richard Ford'@en \n", "5 'village in Appelscha, in the Netherlands'@en \n", "6 '180th strip of the webcomic xkcd'@en \n", "7 'Wikimedia disambiguation page'@en \n", "8 'Shipwreck off the Scottish Coast, imported fr... \n", "9 'moth genus of Pteromalidae'@en \n", "10 'unincorporated community in Kansas, United St... \n", "11 'the country of Canada as depicted in Star Tre... \n", "12 'article in Problems of Empire'@en \n", "13 'unincorporated community in Pike County, Kent... \n", "14 'street in Elsloo, the Netherlands'@en \n", "15 'music group from Ann Arbor, Michigan, U.S.'@en \n", "16 'hamlet in Hampshire, UK'@en \n", "17 'full rigged ship built in 1891'@en \n", "18 '1786 ship'@en \n", "19 'scientific journal'@en " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exact_search_labels('canada')" ] }, { "cell_type": "markdown", "id": "fe55f563-b3c1-4073-928a-2e1073e32635", "metadata": {}, "source": [ "## Define a function to fo a `textlike` search" ] }, { "cell_type": "code", "execution_count": 29, "id": "902c863f-cd5b-4897-a587-b041203d0398", "metadata": {}, "outputs": [], "source": [ "def text_like_search_labels(search_text, limit=20):\n", " search_label = f\"%{'%'.join(search_text.split(' '))}%\"\n", " print(search_text)\n", " text_search_labels_query = _kapi2.get_query(\n", " doc=\"Doc string here\",\n", " name=f\"text_like_search_labels_{search_text}\",\n", " inputs='l_d_pgr_ud',\n", " match='l_d_pgr_ud: (qnode)-[l:label]->(y)',\n", " where=f'textlike(y, \"{search_label}\")',\n", " ret='distinct qnode as node1, y as label, matchscore(y) as score, cast(l.`node1;pagerank`, float) as prank, l.`node1;description` as description',\n", " order='score*prank',\n", " limit=limit\n", " )\n", " results = list([list(x) for x in text_search_labels_query.execute()])\n", " df = pd.DataFrame(results, columns=['node1', 'label', 'score', 'pagerank', 'description'])\n", " print(len(df))\n", " return df\n", " " ] }, { "cell_type": "code", "execution_count": 30, "id": "c55f4cdf-c9ef-4b4a-a0a0-665393b50b09", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[2022-01-13 18:30:18 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT DISTINCT graph_10_c1.\"node1\" \"_aLias.node1\", graph_10_c1.\"node2\" \"_aLias.label\", BM25(txtidx_1.graph_10_txtidx_ldpgridx) \"_aLias.score\", CAST(graph_10_c1.\"node1;pagerank\" AS float) \"_aLias.prank\", graph_10_c1.\"node1;description\" \"_aLias.description\"\n", " FROM graph_10 AS graph_10_c1, graph_10_txtidx_ldpgridx AS txtidx_1\n", " WHERE graph_10_c1.\"label\" = ?\n", " AND txtidx_1.\"node2\" LIKE ? and txtidx_1.rowid = graph_10_c1.rowid\n", " ORDER BY (\"_aLias.score\" * \"_aLias.prank\") ASC\n", " LIMIT ?\n", " PARAS: ['label', '%fifa%group%b%', 20]\n", "---------------------------------------------\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fifa group b\n", "20\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelscorepagerankdescription
0Q31189406'2018 FIFA World Cup Group B'@en-14.9083343.970111e-08
1Q10260332'2014 FIFA World Cup Group B'@en-14.9083342.652110e-08
2Q17115977'Category:2014 FIFA World Cup group table temp...-11.1217732.265241e-08'Wikimedia category'@en
3Q18608402'Category:2015 FIFA Women\\'s World Cup group t...-10.1163381.991870e-08'Wikimedia category'@en
4Q20730977'2018 FIFA World Cup qualification – UEFA Grou...-11.3729561.526389e-08
5Q187411'2010 FIFA World Cup Group B'@en-14.9083341.102419e-08'football tournament'@en
6Q601980'2014 FIFA World Cup qualification – UEFA Grou...-11.3729561.365065e-08
7Q39134591'Template:2018 FIFA World Cup Group B table'@en-12.1995301.267818e-08'Wikimedia template'@en
8Q39134633'Template:2018 FIFA World Cup Group C table'@en-12.1995301.266194e-08'Wikimedia template'@en
9Q39395683'Template:2018 FIFA World Cup Group D table'@en-12.1995301.264030e-08'Wikimedia template'@en
10Q39395642'Template:2018 FIFA World Cup Group E table'@en-12.1995301.263154e-08'Wikimedia template'@en
11Q39134549'Template:2018 FIFA World Cup Group A table'@en-12.1995301.262999e-08'Wikimedia template'@en
12Q39395551'Template:2018 FIFA World Cup Group G table'@en-12.1995301.262856e-08'Wikimedia template'@en
13Q39395510'Template:2018 FIFA World Cup Group H table'@en-12.1995301.262313e-08'Wikimedia template'@en
14Q39395600'Template:2018 FIFA World Cup Group F table'@en-12.1995301.261995e-08'Wikimedia template'@en
15Q86680144'2022 FIFA World Cup qualification – UEFA Grou...-11.3729561.225984e-08
16Q12415572'Template:2006 FIFA World Cup Group B'@en-13.1556701.017848e-08'Wikimedia template'@en
17Q15838739'Template:1994 FIFA World Cup Group B'@en-13.1556701.017848e-08'Wikimedia template'@en
18Q17348977'Template:2002 FIFA World Cup Group B'@en-13.1556701.017848e-08'Wikimedia template'@en
19Q25735803'Template:1998 FIFA World Cup Group B'@en-13.1556701.017848e-08'Wikimedia template'@en
\n", "
" ], "text/plain": [ " node1 label score \\\n", "0 Q31189406 '2018 FIFA World Cup Group B'@en -14.908334 \n", "1 Q10260332 '2014 FIFA World Cup Group B'@en -14.908334 \n", "2 Q17115977 'Category:2014 FIFA World Cup group table temp... -11.121773 \n", "3 Q18608402 'Category:2015 FIFA Women\\'s World Cup group t... -10.116338 \n", "4 Q20730977 '2018 FIFA World Cup qualification – UEFA Grou... -11.372956 \n", "5 Q187411 '2010 FIFA World Cup Group B'@en -14.908334 \n", "6 Q601980 '2014 FIFA World Cup qualification – UEFA Grou... -11.372956 \n", "7 Q39134591 'Template:2018 FIFA World Cup Group B table'@en -12.199530 \n", "8 Q39134633 'Template:2018 FIFA World Cup Group C table'@en -12.199530 \n", "9 Q39395683 'Template:2018 FIFA World Cup Group D table'@en -12.199530 \n", "10 Q39395642 'Template:2018 FIFA World Cup Group E table'@en -12.199530 \n", "11 Q39134549 'Template:2018 FIFA World Cup Group A table'@en -12.199530 \n", "12 Q39395551 'Template:2018 FIFA World Cup Group G table'@en -12.199530 \n", "13 Q39395510 'Template:2018 FIFA World Cup Group H table'@en -12.199530 \n", "14 Q39395600 'Template:2018 FIFA World Cup Group F table'@en -12.199530 \n", "15 Q86680144 '2022 FIFA World Cup qualification – UEFA Grou... -11.372956 \n", "16 Q12415572 'Template:2006 FIFA World Cup Group B'@en -13.155670 \n", "17 Q15838739 'Template:1994 FIFA World Cup Group B'@en -13.155670 \n", "18 Q17348977 'Template:2002 FIFA World Cup Group B'@en -13.155670 \n", "19 Q25735803 'Template:1998 FIFA World Cup Group B'@en -13.155670 \n", "\n", " pagerank description \n", "0 3.970111e-08 \n", "1 2.652110e-08 \n", "2 2.265241e-08 'Wikimedia category'@en \n", "3 1.991870e-08 'Wikimedia category'@en \n", "4 1.526389e-08 \n", "5 1.102419e-08 'football tournament'@en \n", "6 1.365065e-08 \n", "7 1.267818e-08 'Wikimedia template'@en \n", "8 1.266194e-08 'Wikimedia template'@en \n", "9 1.264030e-08 'Wikimedia template'@en \n", "10 1.263154e-08 'Wikimedia template'@en \n", "11 1.262999e-08 'Wikimedia template'@en \n", "12 1.262856e-08 'Wikimedia template'@en \n", "13 1.262313e-08 'Wikimedia template'@en \n", "14 1.261995e-08 'Wikimedia template'@en \n", "15 1.225984e-08 \n", "16 1.017848e-08 'Wikimedia template'@en \n", "17 1.017848e-08 'Wikimedia template'@en \n", "18 1.017848e-08 'Wikimedia template'@en \n", "19 1.017848e-08 'Wikimedia template'@en " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_like_search_labels(\"fifa group b\")" ] }, { "cell_type": "markdown", "id": "69230f55-63f5-40d8-b587-e30c99e3a029", "metadata": {}, "source": [ "# Create the JSON lines file required for KGTK search Elasticsearch Index" ] }, { "cell_type": "markdown", "id": "6febbadd-6a3f-4ff1-9d88-3515a208c5cb", "metadata": {}, "source": [ "## Concatenate all the input files" ] }, { "cell_type": "code", "execution_count": null, "id": "4bcdea12-082b-4878-a5c8-8a6b7808d26a", "metadata": {}, "outputs": [], "source": [ "if create_es_index:\n", " kgtk(\"\"\" cat \n", " -i \"$GRAPH/labels.en.tsv.gz\"\n", " -i \"$GRAPH/aliases.en.tsv.gz\"\n", " -i \"$GRAPH/descriptions.en.tsv.gz\"\n", " -i \"$GRAPH/claims.tsv.gz\"\n", " -i \"$GRAPH/qualifiers.tsv.gz\"\n", " -i \"$GRAPH/metadata.pagerank.undirected.tsv.gz\"\n", " -i \"$GRAPH/derived.isastar.tsv.gz\"\n", " -o \"$OUT\"/kgtk.search.unsorted.tsv.gz\"\"\")" ] }, { "cell_type": "markdown", "id": "62912bbd-4b4b-4cc8-a36e-60fff53722f3", "metadata": {}, "source": [ "## Sort" ] }, { "cell_type": "code", "execution_count": null, "id": "6d52fa08-1b89-4846-817a-5ac73b2c74e5", "metadata": {}, "outputs": [], "source": [ "if create_es_index:\n", " kgtk(f\"\"\"sort --column node1 \n", " -X \"--parallel 8 --buffer-size 60% --temporary-directory {os.environ['TEMP']}\" \n", " -o \"$OUT/kgtk.search.sorted.tsv.gz\" \"\"\")" ] }, { "cell_type": "markdown", "id": "c9ad12a1-9af3-49bd-9192-c9a0ffb68140", "metadata": {}, "source": [ "## Run `kgtk build-kgtk-search-input` command to create the json lines file" ] }, { "cell_type": "code", "execution_count": null, "id": "f10c50d5-72c4-4923-94e5-ef1fcaa19fe7", "metadata": {}, "outputs": [], "source": [ "if create_es_index:\n", " kgtk(\"\"\"build-kgtk-search-input --input-file \"$OUT/kgtk.search.sorted.tsv.gz\"\n", " --output-file \"$OUT\"/kgtk.search.sorted.jl \n", " --label-properties label \n", " --alias-properties alias \n", " --extra-alias-properties P1448,P1705,P1477,P1810,P742,P1449 \n", " --description-properties description \n", " --pagerank-properties Pundirected_pagerank \n", " --mapping-file \"$OUT\"/wikidata_es_kgtk_search_mapping.json \n", " --property-datatype-file \"$GRAPH\"/metadata.property.datatypes.tsv.gz\"\"\")" ] }, { "cell_type": "markdown", "id": "efd306a3-764b-4dfe-a5cc-8555fe3e7529", "metadata": {}, "source": [ "## Create the Elasticsearch Index" ] }, { "cell_type": "code", "execution_count": null, "id": "4bcb7008-4eeb-4908-9ace-b8c056e02f9e", "metadata": {}, "outputs": [], "source": [ "if create_es_index:\n", " os.environ['ESURL'] = f\"{es_host}:{es_port}\"\n", " os.environ['ESINDEX'] = es_index" ] }, { "cell_type": "code", "execution_count": null, "id": "2175ecb8-dac6-476e-a110-b8c568fee417", "metadata": {}, "outputs": [], "source": [ "if create_es_index:\n", " !curl -H \"Content-Type: application/json\" -XPUT $ESURL/$ESINDEX -d @\"$OUT/wikidata_es_kgtk_search_mapping.json\"" ] }, { "cell_type": "markdown", "id": "6208de41-7657-421b-ac61-bdbd22104fb1", "metadata": {}, "source": [ "## Load the json lines file to the ES Index" ] }, { "cell_type": "markdown", "id": "5aceeeba-9a3b-4eb4-a382-27e38dcf63bd", "metadata": {}, "source": [ "The following command requires [table-linker](https://github.com/usc-isi-i2/table-linker) to be installed in the current virtual environment." ] }, { "cell_type": "code", "execution_count": null, "id": "733bb061-f13c-4b21-bea8-fff70de620d1", "metadata": {}, "outputs": [], "source": [ "if create_es_index:\n", " !tl load-elasticsearch-index --es-url $ESURL --es-index $ESINDEX --es-version 7 --kgtk-jl-path \"$OUT\"/kgtk.search.sorted.jl" ] }, { "cell_type": "code", "execution_count": 11, "id": "37a42c75-0d42-45fb-aa24-d671f6b36e70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-11 15:46:03 sqlstore]: IMPORT graph directly into table graph_9 from /Volumes/saggu-ssd/class-viz/class-visualization.edge.tsv.gz ...\n", "[2022-01-11 16:05:10 sqlstore]: IMPORT graph directly into table graph_10 from /Volumes/saggu-ssd/class-viz/class-visualization.node.tsv.gz ...\n", "[2022-01-11 17:49:19 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_9 AS graph_9_c1\n", " LIMIT ?\n", " PARAS: [3]\n", "---------------------------------------------\n", "node1\tlabel\tnode2\tgraph\tedge_type\n", "Q10267817\tP279\tQ18553442\tQ1225194\tsubclass\n", "Q107715\tP279\tQ309314\tQ246672\tsubclass\n", "Q107715\tP279\tQ309314\tQ937228\tsubclass\n" ] } ], "source": [ "!kgtk --debug query --graph-cache /Volumes/saggu-ssd/wikidata-dwd-v2/kgtk-search-6/temp.kgtk-search-6/wikidata.sqlite3.db \\\n", "-i \"/Volumes/saggu-ssd/class-viz/class-visualization.edge.tsv.gz\" --as classvizedge \\\n", "-i \"/Volumes/saggu-ssd/class-viz/class-visualization.node.tsv.gz\" --as classviznode --limit 3" ] }, { "cell_type": "code", "execution_count": 19, "id": "e063212c-0746-478e-b326-5187c31b31e1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 18:20:57 sqlstore]: IMPORT graph directly into table graph_11 from /data/amandeep/wikidata-20211027-dwd-v3/class-visualization.node.tsv.gz ...\n", "[2022-01-13 18:25:51 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_11 AS graph_11_c1\n", " LIMIT ?\n", " PARAS: [3]\n", "---------------------------------------------\n", "node1\tgraph\tinstance_count\tnode_type\tlabel\ttooltip\n", "Q10267817\tQ1225194\t1\tmany_subclasses\t'autosomal recessive disease'@en\tautosomal recessive disease (Q10267817)
instance count: 1
node type: many_subclasses\n", "Q107715\tQ246672\t93\tmany_subclasses\t'physical quantity'@en\tphysical quantity (Q107715)
instance count: 93
node type: many_subclasses\n", "Q107715\tQ937228\t93\tmany_subclasses\t'physical quantity'@en\tphysical quantity (Q107715)
instance count: 93
node type: many_subclasses\n" ] } ], "source": [ "!kgtk --debug query --graph-cache /data/amandeep/wikidata-20211027-dwd-v3/kgtk-browser-files/temp.kgtk-browser-files/wikidata.sqlite3.db \\\n", "-i \"/data/amandeep/wikidata-20211027-dwd-v3/class-visualization.node.tsv.gz\" --as classviznode --limit 3" ] }, { "cell_type": "code", "execution_count": null, "id": "0b50b2ae-968a-47bd-92cf-df399f996360", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i classvizedge --idx label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": null, "id": "f641023d-58c0-41b1-bf19-936e35436568", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 18:26:08 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_11 AS graph_11_c1\n", " LIMIT ?\n", " PARAS: [5]\n", "---------------------------------------------\n", "[2022-01-13 18:26:08 sqlstore]: CREATE INDEX \"graph_11_label_idx\" ON \"graph_11\" (\"label\")\n", "[2022-01-13 18:27:34 sqlstore]: ANALYZE \"graph_11_label_idx\"\n", "node1\tgraph\tinstance_count\tnode_type\tlabel\ttooltip\n", "Q10267817\tQ1225194\t1\tmany_subclasses\t'autosomal recessive disease'@en\tautosomal recessive disease (Q10267817)
instance count: 1
node type: many_subclasses\n", "Q107715\tQ246672\t93\tmany_subclasses\t'physical quantity'@en\tphysical quantity (Q107715)
instance count: 93
node type: many_subclasses\n", "Q107715\tQ937228\t93\tmany_subclasses\t'physical quantity'@en\tphysical quantity (Q107715)
instance count: 93
node type: many_subclasses\n", "Q107715\tQ7184903\t93\tmany_subclasses\t'physical quantity'@en\tphysical quantity (Q107715)
instance count: 93
node type: many_subclasses\n", "Q107715\tQ35120\t93\tmany_subclasses\t'physical quantity'@en\tphysical quantity (Q107715)
instance count: 93
node type: many_subclasses\n" ] } ], "source": [ "!kgtk --debug query -i classviznode --idx label --gc $STORE --limit 5" ] }, { "cell_type": "code", "execution_count": 7, "id": "9ec8033c-9872-4809-863c-24a7931cbe4d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Graph Cache:\n", "DB file: /Volumes/saggu-ssd/wikidata-dwd-v2/kgtk-search-6/temp.kgtk-search-6/wikidata.sqlite3.db\n", " size: 173.57 GB \tfree: 0 Bytes \tmodified: 2022-01-11 17:49:19\n", "\n", "KGTK File Information:\n", "alias:\n", " size: 130.57 MB \tmodified: 2021-11-17 14:57:10 \tgraph: graph_3\n", "claims:\n", " size: 9.59 GB \tmodified: 2021-11-19 12:20:27 \tgraph: graph_5\n", "classvizedge:\n", " size: 416.38 MB \tmodified: 2022-01-11 10:18:54 \tgraph: graph_9\n", "classviznode:\n", " size: 383.09 MB \tmodified: 2022-01-11 10:18:59 \tgraph: graph_10\n", "datatypes:\n", " size: 45.00 KB \tmodified: 2020-12-11 10:51:06 \tgraph: graph_6\n", "description:\n", " size: 341.84 MB \tmodified: 2021-11-19 12:11:44 \tgraph: graph_4\n", "l_d_pgr_ud:\n", " size: 1.18 GB \tmodified: 2021-12-02 20:48:25 \tgraph: graph_8\n", "label:\n", " size: 590.02 MB \tmodified: 2021-11-17 14:57:46 \tgraph: graph_1\n", "pagerank_undirected:\n", " size: 1.53 GB \tmodified: 2021-11-17 14:58:47 \tgraph: graph_2\n", "qualifiers:\n", " size: 2.09 GB \tmodified: 2021-11-19 12:14:21 \tgraph: graph_7\n", "\n", "Graph Table Information:\n", "graph_1:\n", " size: 4.15 GB \tcreated: 2021-12-02 18:36:14\n", " header: ['id', 'node1', 'label', 'node2']\n", "graph_10:\n", " size: 14.55 GB \tcreated: 2022-01-11 17:49:19\n", " header: ['node1', 'graph', 'instance_count', 'node_type', 'label', 'tooltip']\n", "graph_2:\n", " size: 19.31 GB \tcreated: 2021-12-02 18:53:02\n", " header: ['node1', 'label', 'node2', 'id']\n", "graph_3:\n", " size: 711.41 MB \tcreated: 2021-12-02 18:53:47\n", " header: ['id', 'node1', 'label', 'node2']\n", "graph_4:\n", " size: 5.27 GB \tcreated: 2021-12-02 18:57:40\n", " header: ['id', 'node1', 'label', 'node2']\n", "graph_5:\n", " size: 87.81 GB \tcreated: 2021-12-02 19:52:02\n", " header: ['id', 'node1', 'label', 'node2', 'rank', 'node2;wikidatatype']\n", "graph_6:\n", " size: 632.00 KB \tcreated: 2021-12-02 19:52:02\n", " header: ['id', 'node1', 'label', 'node2']\n", "graph_7:\n", " size: 23.57 GB \tcreated: 2021-12-02 20:10:50\n", " header: ['id', 'node1', 'label', 'node2', 'node2;wikidatatype']\n", "graph_8:\n", " size: 11.50 GB \tcreated: 2021-12-02 20:55:01\n", " header: ['node1', 'label', 'node2', 'node2;upper', 'node1;pagerank', 'node1;description']\n", "graph_9:\n", " size: 6.71 GB \tcreated: 2022-01-11 16:05:10\n", " header: ['node1', 'label', 'node2', 'graph', 'edge_type']\n" ] } ], "source": [ "!kgtk query --gc $STORE --show-cache" ] }, { "cell_type": "code", "execution_count": 31, "id": "6e0fa3d5-1bf3-4d04-8e84-7dc2fb05d87c", "metadata": {}, "outputs": [], "source": [ "roots = ['Q1420']\n", "for root in roots:\n", " kgtk(f\"\"\"\n", " query -i classvizedge\n", " --match '(class)-[{{label: property, graph: \"{root}\", edge_type: edge_type}}]->(superclass)'\n", " -o $TEMP/{root}.graph.low.tsv\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": 32, "id": "848932ad-2b2a-4b18-9eab-531c8548d38a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2022-01-13 18:31:29 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_11 AS graph_11_c1\n", " WHERE graph_11_c1.\"graph\" = ?\n", " AND graph_11_c1.\"instance_count\" = graph_11_c1.\"instance_count\"\n", " AND graph_11_c1.\"label\" = graph_11_c1.\"label\"\n", " PARAS: ['Q1420']\n", "---------------------------------------------\n", "\n" ] } ], "source": [ "for root in roots:\n", "\n", " kgtk(f\"\"\"--debug\n", " query -i classviznode\n", " --match '(class)-[{{graph: \"{root}\", instance_count: instance_count, label: label}}]->()'\n", " -o $TEMP/{root}.node.graph.low.tsv\n", " \"\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d9fd381c-c64f-4e76-a3d7-33440cfd3bdd", "metadata": {}, "outputs": [], "source": [ "roots = ['Q1420']\n", "for root in roots:\n", " kgtk(f\"\"\"\n", " query -i classvizedge\n", " --match '(class)-[{{label: property, graph: \"{root}\", edge_type: edge_type}}]->(superclass)'\n", " -o $TEMP/{root}.graph.low.tsv\n", " \"\"\")" ] } ], "metadata": { "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }