{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# KGTK Tutorial\n", "\n", "Beer sites:\n", "- https://www.realbeer.com/edu/health/calories.php\n", "- http://getdrunknotfat.com/alcohol-content-of-beer/" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import sys \n", "sys.path.insert(0, 'tutorial')\n", "from tutorial_setup import *" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ALIAS',\n", " 'ALL',\n", " 'CLAIMS',\n", " 'DESCRIPTION',\n", " 'EXAMPLES_DIR',\n", " 'GE',\n", " 'ISA',\n", " 'ITEM',\n", " 'LABEL',\n", " 'OUT',\n", " 'P279',\n", " 'P279STAR',\n", " 'PROPERTY_DATATYPES',\n", " 'Q154ALIAS',\n", " 'Q154ALL',\n", " 'Q154CLAIMS',\n", " 'Q154DESCRIPTION',\n", " 'Q154ISA',\n", " 'Q154ITEM',\n", " 'Q154LABEL',\n", " 'Q154P279',\n", " 'Q154P279STAR',\n", " 'Q154PROPERTY_DATATYPES',\n", " 'Q154QUALIFIERS',\n", " 'Q154QUALIFIERS_TIME',\n", " 'Q154SITELINKS',\n", " 'QUALIFIERS',\n", " 'QUALIFIERS_TIME',\n", " 'SITELINKS',\n", " 'STORE',\n", " 'TE',\n", " 'TEMP',\n", " 'WIKIDATA',\n", " 'kgtk',\n", " 'kypher']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kgtk_environment_variables" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher\n" ] } ], "source": [ "%cd {output_path}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: wikidata_os_v5: File exists\n", "mkdir: temp.wikidata_os_v5: File exists\n" ] } ], "source": [ "!mkdir {output_folder}\n", "!mkdir {temp_folder}" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding: File exists\n" ] } ], "source": [ "!mkdir \"$GE\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding: File exists\n" ] } ], "source": [ "!mkdir \"$TE\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wikidata in KGTK\n", "KGTK has the ability to import a Wikidata JSON dump and covert it to the KGTK representation to make it easy to process the full Wikidata KG in a laptop. There are 86 files which include all the information available in the Wikidata dump and files containing commonly used information derived from the dump. We partitioned the files because in most use cases you only need to use a subset of the files.\n", "\n", "The files are very large. `claims.tsv` (23GB compressed) contains all the statements in the Wikidata dump, `qualifiers.tsv` contains the qualifiers of those edges, and `labels.en.tsv`, `aliases.en.tsv` and `descriptions.en.tsv` contain the English labels, aliases and descriptions." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 pedroszekely staff 68M Nov 16 08:07 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/aliases.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 4.7G Nov 16 08:05 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 269M Nov 16 08:08 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/descriptions.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 376M Nov 16 08:06 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/labels.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 662M Nov 16 08:43 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/qualifiers.tsv.gz\n" ] } ], "source": [ "!ls -lh \"$CLAIMS\" \"$QUALIFIERS\" \"$LABEL\" \"$ALIAS\" \"$DESCRIPTION\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`claims.tsv` contains many edges:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 254135077 1578463882 20285305033\n", "\n", "real\t1m15.857s\n", "user\t2m7.309s\n", "sys\t0m8.130s\n" ] } ], "source": [ "!time zcat < \"$CLAIMS\" | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# KGTK Data Model\n", "The KGTK data model is a generalization of RDF and property graphs, inspired by the Wikidata data model. In KGTK, a KG is represented using TSV files with four columns: three columns to store the subject, predicate and object of a triple, and a fourth column to store an identifier for the triple. By convention, we use the heading `id` for the identifier, `node1` for the subject, `node2` for the object and `label` for the predicate, as it labels the edge between `node1` and `node2`. The order of the columns is arbitrary.\n", "\n", "All KGTK files must include the required `id`, `node1`, `label` and `node2` columns, and can contain additional columns to store addtional information about an edge or the nodes in the edge. We will explain the details after we discuss *qualifiers*.\n", "Let's take a look at the first few lines of the `claims.tsv` file. We see the four required columns and two additional columns that the Wikidata import includes to facilitate processing of the `claims` file using custom scripts. The `rank` column records the Wikidata rank of a statement, and the `node2;wikidatatype` records the Wikidata type of the value in the `node2` column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Claims" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2 rank node2;wikidatatype\n", "P10-P1628-32b85d-7927ece6-0 P10 P1628 \"http://www.w3.org/2006/vcard/ns#Video\" normal url\n", "P10-P1628-acf60d-b8950832-0 P10 P1628 \"https://schema.org/video\" normal url\n", "P10-P1629-Q34508-bcc39400-0 P10 P1629 Q34508 normal wikibase-item\n", "P10-P1659-P1651-c4068028-0 P10 P1659 P1651 normal wikibase-property\n", "P10-P1659-P18-5e4b9c4f-0 P10 P1659 P18 normal wikibase-property\n", "P10-P1659-P4238-d21d1ac0-0 P10 P1659 P4238 normal wikibase-property\n", "P10-P1659-P51-86aca4c5-0 P10 P1659 P51 normal wikibase-property\n", "P10-P1855-Q15075950-7eff6d65-0 P10 P1855 Q15075950 normal wikibase-item\n", "P10-P1855-Q69063653-c8cdb04c-0 P10 P1855 Q69063653 normal wikibase-item\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$CLAIMS\" | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wikidata uses numbers to identify items and properties. We can use the `wd` utility (https://github.com/maxlath/wikibase-cli) to understand the first few lines. The second line states that the `P10` property in Wikidata has an equivalent property in another ontology. Notice that each edge has a distinct id. These ids are unique identifiers for statements (the format of the id can be arbitrary, but we assigned ids so that sorting files by id arranges the information so that all edges about a subject are consecutive." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/local/lib/node_modules/wikibase-cli/lib/entity_data_parser.js:6\n", "module.exports = async params => {\n", " ^^^^^^\n", "\n", "SyntaxError: Unexpected identifier\n", " at createScript (vm.js:56:10)\n", " at Object.runInThisContext (vm.js:97:10)\n", " at Module._compile (module.js:549:28)\n", " at Object.Module._extensions..js (module.js:586:10)\n", " at Module.load (module.js:494:32)\n", " at tryModuleLoad (module.js:453:12)\n", " at Function.Module._load (module.js:445:3)\n", " at Module.require (module.js:504:17)\n", " at require (internal/module.js:20:19)\n", " at Object. (/usr/local/lib/node_modules/wikibase-cli/bin/wb-summary:2:26)\n" ] } ], "source": [ "!wd u P10 P1628 P1629" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a more meaningful example. `Q31` (https://www.wikidata.org/wiki/Q31) is the Wikidata item about Belgium. We will use the KGTK query to fetch edges about Belgium. `$kypher` is a shortcut to the `kgtk query` command where in addition we pass in the location of the SQLite database we are using ot store the files. KGTK queries use Cypher syntax (https://neo4j.com/developer/cypher/): the following simple query retrieves 10 edges where `node1` is `Q31`, the q-node for Belgium. The results include an edge with `label` `P1036` (Dewey Decimal Classification) and several edges with label `P1081` (human development index)." ] }, { "cell_type": "code", "execution_count": 262, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2ranknode2;wikidatatype
0Q31-P1036-c4e1ad-df86eeb8-0Q31P1036\"2--493\"normalexternal-id
1Q31-P1081-02c2ed-033524b0-0Q31P1081+0.866normalquantity
2Q31-P1081-02c2ed-7971505b-0Q31P1081+0.866normalquantity
3Q31-P1081-068470-c1c63b8d-0Q31P1081+0.889normalquantity
4Q31-P1081-068470-ddac01e0-0Q31P1081+0.889normalquantity
5Q31-P1081-144738-c1851cdc-0Q31P1081+0.905normalquantity
6Q31-P1081-175742-c07ac1c8-0Q31P1081+0.888normalquantity
7Q31-P1081-19636d-c08dd8a8-0Q31P1081+0.896normalquantity
8Q31-P1081-1efc03-433a7a4d-0Q31P1081+0.913normalquantity
9Q31-P1081-1f8602-ddac530d-0Q31P1081+0.852normalquantity
\n", "
" ], "text/plain": [ " id node1 label node2 rank \\\n", "0 Q31-P1036-c4e1ad-df86eeb8-0 Q31 P1036 \"2--493\" normal \n", "1 Q31-P1081-02c2ed-033524b0-0 Q31 P1081 +0.866 normal \n", "2 Q31-P1081-02c2ed-7971505b-0 Q31 P1081 +0.866 normal \n", "3 Q31-P1081-068470-c1c63b8d-0 Q31 P1081 +0.889 normal \n", "4 Q31-P1081-068470-ddac01e0-0 Q31 P1081 +0.889 normal \n", "5 Q31-P1081-144738-c1851cdc-0 Q31 P1081 +0.905 normal \n", "6 Q31-P1081-175742-c07ac1c8-0 Q31 P1081 +0.888 normal \n", "7 Q31-P1081-19636d-c08dd8a8-0 Q31 P1081 +0.896 normal \n", "8 Q31-P1081-1efc03-433a7a4d-0 Q31 P1081 +0.913 normal \n", "9 Q31-P1081-1f8602-ddac530d-0 Q31 P1081 +0.852 normal \n", "\n", " node2;wikidatatype \n", "0 external-id \n", "1 quantity \n", "2 quantity \n", "3 quantity \n", "4 quantity \n", "5 quantity \n", "6 quantity \n", "7 quantity \n", "8 quantity \n", "9 quantity " ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher_raw -i \"$CLAIMS\" \\\n", "--match '(:Q31)-[]-()' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output of the command above is hard to read because we are seeing the numeric Wikidata identifiers. To make the output more readable, we need to look up the labels of the Wikidata nodes. This information is in the `labels.en.tsv` file." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: id node1 label node2\n", "P10-label-en P10 label 'video'@en\n", "P1000-label-en P1000 label 'record held'@en\n", "P1001-label-en P1001 label 'applies to jurisdiction'@en\n", "P1002-label-en P1002 label 'engine configuration'@en\n", "error writing to outputP1003-label-en P1003 label 'National Library of Romania ID'@en\n", ": P1004-label-en P1004 label 'MusicBrainz place ID'@en\n", "Broken pipe\n", "P1005-label-en P1005 label 'Portuguese National Library ID'@en\n", "P1006-label-en P1006 label 'Nationale Thesaurus voor Auteurs ID'@en\n", "P1007-label-en P1007 label 'Lattes Platform number'@en\n" ] } ], "source": [ "!zcat < \"$LABEL\" | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With KGTK accepts multiple files as input, and can do a join to retrieve the label for each property. When using multiple files, it is necessary to tag each clause with the file that provides the data for the clause. For example, the first clause is tagged with `claim` as the word `claim` is part of the file name. The variable property is used to connect the two clauses." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.90 real 0.77 user 0.11 sys\n", "id node1 label node2 label;label\n", "Q31-P1036-c4e1ad-df86eeb8-0 Q31 P1036 \"2--493\" 'Dewey Decimal Classification'@en\n", "Q31-P1081-02c2ed-033524b0-0 Q31 P1081 +0.866 'Human Development Index'@en\n", "Q31-P1081-02c2ed-7971505b-0 Q31 P1081 +0.866 'Human Development Index'@en\n", "Q31-P1081-068470-c1c63b8d-0 Q31 P1081 +0.889 'Human Development Index'@en\n", "Q31-P1081-068470-ddac01e0-0 Q31 P1081 +0.889 'Human Development Index'@en\n", "Q31-P1081-144738-c1851cdc-0 Q31 P1081 +0.905 'Human Development Index'@en\n", "Q31-P1081-175742-c07ac1c8-0 Q31 P1081 +0.888 'Human Development Index'@en\n", "Q31-P1081-19636d-c08dd8a8-0 Q31 P1081 +0.896 'Human Development Index'@en\n", "Q31-P1081-1efc03-433a7a4d-0 Q31 P1081 +0.913 'Human Development Index'@en\n", "Q31-P1081-1f8602-ddac530d-0 Q31 P1081 +0.852 'Human Development Index'@en\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" -i \"$LABEL\" \\\n", "--match 'claim: (n1:Q31)-[l {label: property}]-(n2), label: (property)-[:label]->(property_label)' \\\n", "--return 'l as id, n1 as node1, property as label, n2 as node2, property_label as `label;label`' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a the heads of state of Belgium recorded in property `P35`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.86 real 0.74 user 0.10 sys\n", "id node1 label node2 node2;label\n", "Q31-P35-Q1079522-c82ed584-0 Q31 P35 Q1079522 'Erasme Louis Surlet de Chokier'@en\n", "Q31-P35-Q12967-f2b9aaf3-0 Q31 P35 Q12967 'Leopold II of Belgium'@en\n", "Q31-P35-Q12971-2088471b-0 Q31 P35 Q12971 'Leopold I of Belgium'@en\n", "Q31-P35-Q12973-31c1b700-0 Q31 P35 Q12973 'Leopold III of Belgium'@en\n", "Q31-P35-Q12976-f3e8a567-0 Q31 P35 Q12976 'Baudouin I of Belgium'@en\n", "Q31-P35-Q155004-619ba603-0 Q31 P35 Q155004 'Philippe I of Belgium'@en\n", "Q31-P35-Q3911-137f01fe-0 Q31 P35 Q3911 'Albert II of Belgium'@en\n", "Q31-P35-Q445553-7599749f-0 Q31 P35 Q445553 'Prince Charles, Count of Flanders'@en\n", "Q31-P35-Q55008046-725dce40-0 Q31 P35 Q55008046 'Albert I of Belgium'@en\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" -i \"$LABEL\" \\\n", "--match 'claims: (n1:Q31)-[l:P35]->(n2), labels: (n2)-[:label]->(n2_label)' \\\n", "--return 'l as id, n1 as node1, l.label as label, n2 as node2, n2_label as `node2;label`' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Qualifiers\n", "Qualifiers provide additional information about the claims stated in the edges. For `P1081` the qualifiers tell use the year, and for head of state the qualifiers provide information about the period of time and position held by the head of state. The qualifiers can be retrieved using the identifiers of the edges. Let's retrieve the qualifiers associated with the edge for the first head of state (Erasme Louis). To do so, we use the identifier of the edge (`Q31-P35-Q1079522-c82ed584-0`) as `node1` in the `qualifiers.tsv` file. We get three edges, meaning that the edge `Q31/P35/Q1079522` has three qualifiers. Note that the qualifier edges are the same as any other edge in KGTK, having `id`, `node1`, `label` and `node2` columns:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.90 real 0.77 user 0.11 sys\n", "id node1 label node2 node2;wikidatatype\n", "Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0 Q31-P35-Q1079522-c82ed584-0 P39 Q477406 wikibase-item\n", "Q31-P35-Q1079522-c82ed584-0-P580-106076-0 Q31-P35-Q1079522-c82ed584-0 P580 ^1831-02-25T00:00:00Z/11 time\n", "Q31-P35-Q1079522-c82ed584-0-P582-774519-0 Q31-P35-Q1079522-c82ed584-0 P582 ^1831-07-20T00:00:00Z/11 time\n" ] } ], "source": [ "!$kypher -i \"$QUALIFIERS\" \\\n", "--match '(n1:`Q31-P35-Q1079522-c82ed584-0`)-[l]->(n2)' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make them readable: the following query combines the patterns of the previous two queries to retrieve the labels of the property and node2. The query omits the identifier of the qualifier edges to save space. Also, the headers of the two additional columns can be arbitrary, i.e., you can name them whatever you want; the names used follow a KGTK convention that enabled KGTK to automatically parse the output, which is useful if we want to use the output as an input to another KGTK command. The word before the `;` refers to one of the standard columns, and the name after the `;` refers to a property of that element. In this example, we used `label` as the column contains the label of the entity." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.90 real 0.77 user 0.11 sys\n", "node1 label node2 label;label\n", "Q31-P35-Q1079522-c82ed584-0 P39 Q477406 'position held'@en\n", "Q31-P35-Q1079522-c82ed584-0 P580 ^1831-02-25T00:00:00Z/11 'start time'@en\n", "Q31-P35-Q1079522-c82ed584-0 P582 ^1831-07-20T00:00:00Z/11 'end time'@en\n" ] } ], "source": [ "!$kypher -i \"$QUALIFIERS\" -i \"$LABEL\" \\\n", "--match 'qual: (n1:`Q31-P35-Q1079522-c82ed584-0`)-[l {label: property}]->(n2), labels: (property)-[:label]->(property_label)' \\\n", "--return 'n1 as node1, property as label, n2 as node2, property_label as `label;label`' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put all the values of `P35` in a file, which we will conveniently name `Q31.P35.tsv`" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.83 real 0.71 user 0.09 sys\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" \\\n", "--match '(n1:Q31)-[l:P35]->(n2)' \\\n", "--return 'l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q31.P35.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to combine the `P35` edges of Belgium with the qualifiers. To do this we will run a query that uses the edges that we stored in `Q31.P35.tsv`, and retrieve the qualifiers for each of those edges; the result of our query will be the qualifier edges of the head of state edges. To union the qualifier edges with the claim edges, we feed the output of the query to the `cat` command (concatenate), and then feed the output to the `sort2` command to sort the edges. The first 12 edges are shown below. We see a claim edge followed by the qualifiers defined for it.\n", "\n", "This snippet illustrates that KGTK commands can be chained using the `/` chain operator to compose more complex workflows." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "Q31-P35-Q1079522-c82ed584-0 Q31 P35 Q1079522\n", "Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0 Q31-P35-Q1079522-c82ed584-0 P39 Q477406\n", "Q31-P35-Q1079522-c82ed584-0-P580-106076-0 Q31-P35-Q1079522-c82ed584-0 P580 ^1831-02-25T00:00:00Z/11\n", "Q31-P35-Q1079522-c82ed584-0-P582-774519-0 Q31-P35-Q1079522-c82ed584-0 P582 ^1831-07-20T00:00:00Z/11\n", "Q31-P35-Q12967-f2b9aaf3-0 Q31 P35 Q12967\n", "Q31-P35-Q12967-f2b9aaf3-0-P39-Q13592862-0 Q31-P35-Q12967-f2b9aaf3-0 P39 Q13592862\n", "Q31-P35-Q12967-f2b9aaf3-0-P580-f29037-0 Q31-P35-Q12967-f2b9aaf3-0 P580 ^1865-12-17T00:00:00Z/11\n", "Q31-P35-Q12967-f2b9aaf3-0-P582-136f02-0 Q31-P35-Q12967-f2b9aaf3-0 P582 ^1909-12-17T00:00:00Z/11\n", "Q31-P35-Q12971-2088471b-0 Q31 P35 Q12971\n", "Q31-P35-Q12971-2088471b-0-P39-Q13592862-0 Q31-P35-Q12971-2088471b-0 P39 Q13592862\n", "Q31-P35-Q12971-2088471b-0-P580-a35d41-0 Q31-P35-Q12971-2088471b-0 P580 ^1831-06-04T00:00:00Z/11\n", " 1.83 real 2.86 user 0.47 sys\n" ] } ], "source": [ "!$kypher -i \"$QUALIFIERS\" -i \"$TEMP\"/Q31.P35.tsv \\\n", "--match 'P35: ()-[l]->(), qual: (l)-[lq]->(n2)' \\\n", "--return 'lq as id, l as node1, lq.label as label, n2 as node2' \\\n", "/ cat -i - -i \"$TEMP\"/Q31.P35.tsv \\\n", "/ sort2 \\\n", "| head -12 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "- KGTK represents graphs in TSV files with standard columns `id`, `node1`, `label` and `node2`\n", "- It is possible to include arbitrary additional columns in KGTK files\n", "- The identifier of an edge can be used as a node in another edge enabling the representation of edges about edges\n", "- KGTK provides a powerful query command based on Cypher as well as a host of other commands, type `kgtk --help` to see the list of commands." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use Case: A Knowledge Graph About Alocholic Beverages\n", "We are going to build a small KG about alcoholoc beverages by extracting from Wikidata the subgraph that relates to alcoholic beverages (https://www.wikidata.org/wiki/Q154)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: create a list of all descendants of `alcoholic beverage` (https://www.wikidata.org/wiki/Q154)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/usr/local/lib/node_modules/wikibase-cli/lib/entity_data_parser.js:6\n", "module.exports = async params => {\n", " ^^^^^^\n", "\n", "SyntaxError: Unexpected identifier\n", " at createScript (vm.js:56:10)\n", " at Object.runInThisContext (vm.js:97:10)\n", " at Module._compile (module.js:549:28)\n", " at Object.Module._extensions..js (module.js:586:10)\n", " at Module.load (module.js:494:32)\n", " at tryModuleLoad (module.js:453:12)\n", " at Function.Module._load (module.js:445:3)\n", " at Module.require (module.js:504:17)\n", " at require (internal/module.js:20:19)\n", " at Object. (/usr/local/lib/node_modules/wikibase-cli/bin/wb-summary:2:26)\n" ] } ], "source": [ "!wd u Q154" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wikidata uses two properties to organize entities in a hierarchy: the `instance of` property (`P31`) and the `subclass of` (`P279`) property. In many cases, the distinction between instance of and subclass of is subtle, and we find many situations in Wikidata where either one or the other is used to organize hierarchies. For this reason, we created a new property called `isa` that contains the union of `P31` and `P279` and stored in the file `derived.isa.tsv`" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "P10\tisa\tQ18610173\n", "P1000\tisa\tQ18608871\n", "P1001\tisa\tQ15720608\n", "P1001\tisa\tQ22984026\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$ISA\" | head -5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get all the alcoholic beverages, we need to get all entities that are `isa` of alcoholic beverage (`Q154`) or that are `isa` of any descendant of `Q154` in the `subclass of` (`P279`) hierarchy. The length of the chain of `P279` edges can be arbitrarily long. To support this uise case, KGTK offers the `derived.P279star.tsv` file that contains edges `n1/P279star/n2` if `n1` is a descendant of `n2` on chains of `P279` edges, includiing chains of zero length (`n1/P279star/n1`)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: node1 label node2 id\n", "Q1000032 P279star Q1000032 Q1000032-P279star-Q1000032-0000\n", "Q1000032 P279star Q1150070 Q1000032-P279star-Q1150070-0000\n", "Q1000032 P279star Q1190554 Q1000032-P279star-Q1190554-0000\n", "Q1000032 P279star Q133500 Q1000032-P279star-Q133500-0000\n", "error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$P279STAR\" | head -5 | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get all alcoholic beverages, we need to find all nodes `n1` that are connected to `Q154` with an `isa` edge and a chain of `P279` edges:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3.18 real 0.93 user 0.57 sys\n" ] } ], "source": [ "!$kypher -i \"$ISA\" -i \"$P279STAR\" -i \"$LABEL\" \\\n", "--match 'isa: (n1)-[]->(n2), star: (n2)-[]->(n3:Q154), label: (n1)-[]->(n1l)' \\\n", "--return 'n1 as node1, n1l as `node1;label`, n3 as node2, \"isastar\" as label' \\\n", "-o \"$TEMP\"/Q154.descendant.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a sample of alcoholic beverages in Wikidata" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 node1;label node2 label\n", "Q1350656 'Corn whiskey'@en Q154 isastar\n", "Q20713240 'Buckwheat whisky'@en Q154 isastar\n", "Q2535077 'Rye Whiskey'@en Q154 isastar\n", "Q536976 'Canadian whisky'@en Q154 isastar\n", "Q7991845 'Wheat whiskey'@en Q154 isastar\n", "Q10429117 'Beyaz'@en Q154 isastar\n", "Q1069954 'Prosecco'@en Q154 isastar\n", "Q1094850 'Clairette du Languedoc'@en Q154 isastar\n", "Q1135592 'Cortese di Gavi'@en Q154 isastar\n" ] } ], "source": [ "!head \"$TEMP\"/Q154.descendant.tsv | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An the total number:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3251 16116 133341 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.descendant.tsv\n" ] } ], "source": [ "!wc \"$TEMP\"/Q154.descendant.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The computation of `Q154.descendant.tsv` can be implemented in SPARQL using the common `P31/P279*` graph pattern, but the query will time out if the result size is large. For example, the query will time out when requesting all descendants of chemical compounds, as there are over one million chemical compounds in Wikidata. The query can be easily done in KGTK." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: get the incoming and outgoing edges\n", "We want out graph to have the neighbors of all alcoholic beverages, so we need to get the incoming and outgoing edges.\n", "\n", "The following query gets the outgoing edges." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.34 real 1.03 user 0.36 sys\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.descendant.tsv \\\n", "--match 'Q154: (n1)-[]->(), claims: (n1)-[l]->(n2)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.node1.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that we are getting several properties for our items:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "Q1000737-P1435-Q17297633-53903946-0 Q1000737 P1435 Q17297633\n", "Q1000737-P1454-Q460178-8ad4931b-0 Q1000737 P1454 Q460178\n", "Q1000737-P159-Q16003-31e24011-0 Q1000737 P159 Q16003\n", "Q1000737-P17-Q183-24107fe2-0 Q1000737 P17 Q183\n", "Q1000737-P18-147fc9-667304f8-0 Q1000737 P18 \"Marthabräuhalle 2011-04-03.jpg\"\n", "Q1000737-P31-Q131734-f97bd6f6-0 Q1000737 P31 Q131734\n", "Q1000737-P31-Q15075508-a4c83928-0 Q1000737 P31 Q15075508\n", "Q1000737-P373-689157-3110aade-0 Q1000737 P373 \"Marthabräu\"\n", "Q1000737-P452-Q869095-f5d8e7a2-0 Q1000737 P452 Q869095\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.node1.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now get the incoming edges:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.23 real 0.98 user 0.36 sys\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.descendant.tsv \\\n", "--match 'Q154: (n1)-[]->(), claims: (n3)-[l]->(n1)' \\\n", "--return 'distinct l as id, n3 as node1, l.label as label, n1 as node2' \\\n", "-o \"$TEMP\"/Q154.node2.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a sample of the edges we are getting" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "Q1350656-P279-Q1007164-7e3ecba9-0 Q1350656 P279 Q1007164\n", "zcat: Q20713240-P279-Q1007164-b3112260-0 Q20713240 P279 Q1007164\n", "Q2535077-P279-Q1007164-b2d3684b-0 Q2535077 P279 Q1007164\n", "Q536976-P279-Q1007164-8bf7467b-0 Q536976 P279 Q1007164\n", "Q7991845-P279-Q1007164-18bc383a-0 Q7991845 P279 Q1007164\n", "Q10337004-P186-Q10210-c56dd7ce-0 Q10337004 P186 Q10210\n", "Q10429117-P31-Q10210-d342f061-0 Q10429117 P31 Q10210\n", "Q1051699-P279-Q10210-65d32c67-0 Q1051699 P279 Q10210\n", "error writing to outputQ1058259-P279-Q10210-e204554a-0 Q1058259 P279 Q10210\n", ": Broken pipe\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.node2.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate the incoming and outgoing edges to put them in a single file:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.23 real 1.10 user 0.10 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.node1.tsv.gz -i \"$TEMP\"/Q154.node2.tsv.gz -o \"$TEMP\"/Q154.claims.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have over 30,000 edges:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 28142 116045 1584824\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.claims.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Summary of where we are:\n", "- Computed the list of entities below alcoholic beverage\n", "- Found all incoming and outgoing edges to these entities; for the new entities we bring in, we have no information, we only have the q-node\n", "\n", "Not having any information about the entities connected to the alcoholic beverages is limiting, so let's get their outgoing edges. We run the query with `Q154.claims.tsv` which will use all the entities in our graph, including the alcoholic beverages for which we already got outgoing edges; no harm done, as we can eliminate duplicated later." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.75 real 3.92 user 0.57 sys\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.claims.tsv.gz \\\n", "--match 'Q154: ()-[]->(n1), claims: (n1)-[l]->(n2)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.hop.out.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For sanity check, let's take a peek:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "Q1000-P1036-9bef62-f77ac5cf-0 Q1000 P1036 \"2--6721\"\n", "Q1000-P1081-0d345f-3a33abf5-0 Q1000 P1081 +0.641\n", "Q1000-P1081-0d345f-6da37c02-0 Q1000 P1081 +0.641\n", "Q1000-P1081-1100e3-c7631769-0 Q1000 P1081 +0.624\n", "Q1000-P1081-1ada51-7c71c229-0 Q1000 P1081 +0.639\n", "Q1000-P1081-345681-88a99cab-0 Q1000 P1081 +0.702\n", "Q1000-P1081-347db1-da0e5e03-0 Q1000 P1081 +0.637\n", "Q1000-P1081-419245-b03a8b59-0 Q1000 P1081 +0.647\n", "Q1000-P1081-419245-f8cd58e8-0 Q1000 P1081 +0.647\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.hop.out.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consolidate our edge files into one larger file. We use compact to remove duplicates and sort to keep edges for the same subject together:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.07 real 7.09 user 0.59 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.claims.tsv.gz -i \"$TEMP\"/Q154.hop.out.tsv.gz \\\n", "/ compact \\\n", "/ sort2 \\\n", "-o \"$TEMP\"/Q154.edges.1.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have over 170,000 edges:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 165133 678398 8868474\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.edges.1.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a peek:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "P1389-P1855-Q1109662-9e2ef218-0 P1389 P1855 Q1109662\n", "P1582-P1855-Q17329207-f4ef508d-0 P1582 P1855 Q17329207\n", "P2581-P1855-Q7639844-08b3a4c7-0 P2581 P1855 Q7639844\n", "P2665-P1855-Q1067702-402a80a9-0 P2665 P1855 Q1067702\n", "P2665-P1855-Q170210-30d44f0b-0 P2665 P1855 Q170210\n", "P5420-P1855-Q44-209cffb1-0 P5420 P1855 Q44\n", "P5420-P1855-Q722338-73d7be75-0 P5420 P1855 Q722338\n", "zcat: P6088-P1855-Q1543214-3d934541-0 P6088 P1855 Q1543214\n", "P6088-P1855-Q4626-4ed65964-0 P6088 P1855 Q4626\n", "error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.edges.1.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have all the alcoholic beverages, we want to get the upper ontology of all the classes used, so that every class in our KG has a path to the root of the ontology. For example, first go to `drink` (`Q40050`), then to `liquid` (`Q11435`), then `fluid` (`Q102205`) and so on until we reach `entity` (`Q35120`).\n", "\n", "To do this, we need to get all the `isa` of all items in our graph, then get `P279star` so we get the list of all classes that these items descend from. Finally we need to get all the `P279` edges between them." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 13.58 real 9.23 user 1.18 sys\n" ] } ], "source": [ "!$kypher -i \"$TEMP\"/Q154.edges.1.tsv.gz -i \"$P279STAR\" -i \"$ISA\" \\\n", "--match 'Q154: (n1)-[]->(), isa: (n1)-[]->(n2), P279: (n2)-[]->(class)' \\\n", "--return 'distinct class as node1' \\\n", "-o \"$TEMP\"/Q154.classes.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have almost 3,000 classes in the upper ontology for the entities in our graph:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2846 2846 24939 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.classes.tsv\n" ] } ], "source": [ "!wc \"$TEMP\"/Q154.classes.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now use the `derived.P279.tsv` file to get the `P279` edges that connect a class to its superclass." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.48 real 0.89 user 0.22 sys\n" ] } ], "source": [ "!$kypher -i \"$TEMP\"/Q154.classes.tsv -i \"$P279\" \\\n", "--match 'Q154: (class)-[]->(), P279: (class)-[l]->(super)' \\\n", "--return 'distinct l as id, class as node1, l.label as label, super as node2' \\\n", "-o \"$TEMP\"/Q154.P279.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get close to 5,000 `P279` edges in the upper ontology; we will take care of potential duplicates at a final cleanup step:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 4517 18068 249492 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.P279.tsv\n" ] } ], "source": [ "!wc \"$TEMP\"/Q154.P279.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see several q-nodes below `entity` (`Q35120`), a good indication that we computed the upper ontology correctly:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q16686448-P279-Q35120-674edbf9-0 Q16686448 P279 Q35120\n", "Q35120-P279-25b964-0520e300-0 Q35120 P279 novalue\n", "Q58415929-P279-Q35120-75659d0c-0 Q58415929 P279 Q35120\n", "Q23958946-P279-Q35120-70a9ed90-0 Q23958946 P279 Q35120\n", "Q488383-P279-Q35120-5fad2ad7-0 Q488383 P279 Q35120\n" ] } ], "source": [ "!grep Q35120 \"$TEMP\"/Q154.P279.tsv | head -5 | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consolidate the edges again:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.14 real 7.12 user 0.57 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.edges.1.tsv.gz -i \"$TEMP\"/Q154.P279.tsv \\\n", "/ compact \\\n", "/ sort2 \\\n", "-o \"$TEMP\"/Q154.edges.2.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have over 175,000 edges:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 169047 694054 9085731\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.edges.2.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Summary:\n", "- We have the instances of alcoholic beverages\n", "- We added incoming and outgoing edges\n", "- For the outgoing edges, we went one hop forward\n", "- We got the upper ontology\n", "\n", "The properties are also items in Wikidata, so let's collect them all and get their edges." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.13 real 2.03 user 0.31 sys\n" ] } ], "source": [ "!$kypher -i \"$TEMP\"/Q154.edges.2.tsv.gz \\\n", "--match '()-[l {label: property}]->()' \\\n", "--return 'distinct property as node1' \\\n", "-o \"$TEMP\"/Q154.properties.tsv" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\n", "P10\n", "P1001\n", "P1003\n", "P1004\n", "P1005\n", "P1006\n", "P101\n", "P1014\n", "P1015\n" ] } ], "source": [ "!head \"$TEMP\"/Q154.properties.tsv | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the edges of these properties:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.25 real 0.91 user 0.18 sys\n" ] } ], "source": [ "!$kypher -i \"$CLAIMS\" -i \"$TEMP\"/Q154.properties.tsv \\\n", "--match 'Q154: (p)-[]->(), claims: (p)-[l]->(n2)' \\\n", "--return 'distinct l as id, p as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.properties.edges.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a peek, looks like what we had before as the file is sorted, let's proceed:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "P10-P1628-32b85d-7927ece6-0 P10 P1628 \"http://www.w3.org/2006/vcard/ns#Video\"\n", "P10-P1628-acf60d-b8950832-0 P10 P1628 \"https://schema.org/video\"\n", "P10-P1629-Q34508-bcc39400-0 P10 P1629 Q34508\n", "P10-P1659-P1651-c4068028-0 P10 P1659 P1651\n", "P10-P1659-P18-5e4b9c4f-0 P10 P1659 P18\n", "P10-P1659-P4238-d21d1ac0-0 P10 P1659 P4238\n", "P10-P1659-P51-86aca4c5-0 P10 P1659 P51\n", "P10-P1855-Q15075950-7eff6d65-0 P10 P1855 Q15075950\n", "P10-P1855-Q69063653-c8cdb04c-0 P10 P1855 Q69063653\n" ] } ], "source": [ "!head \"$TEMP\"/Q154.properties.edges.tsv | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consolidate the edges again:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6.18 real 8.48 user 0.65 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.edges.2.tsv.gz -i \"$TEMP\"/Q154.properties.edges.tsv \\\n", "/ compact \\\n", "/ sort2 \\\n", "-o \"$TEMP\"/Q154.edges.3.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of edges grew a bit to 206,000" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 197521 811687 10791930\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.edges.3.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Summary:\n", "- We have the instances of alcoholic beverages\n", "- We added incoming and outgoing edges\n", "- For the outgoing edges, we went one hop forward\n", "- We got the upper ontology\n", "- And we have the edges on all the properties being used\n", "\n", "We will stop adding nodes to the KG at this time, and proceed to add the labels for all the nodes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: get the labels, aliases and descriptions of all the items in our KG\n", "Before we start, let's define an environment variable to hold the final edges file so that if we change our mind later, we can update it without having to change the commands below." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "os.environ[\"Q154GRAPH\"] = os.environ[\"TEMP\"] + \"/Q154.edges.3.tsv.gz\"" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.edges.3.tsv.gz\n" ] } ], "source": [ "!ls \"$Q154GRAPH\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the labels of the `node1` nodes" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.02 real 2.81 user 0.87 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$LABEL\" \\\n", "--match 'Q154: (n1)-[]-(), label: (n1)-[l]->(n2)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.label.node1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2\n", "P10-label-en P10 label 'video'@en\n", "P1001-label-en P1001 label 'applies to jurisdiction'@en\n", "P1003-label-en P1003 label 'National Library of Romania ID'@en\n", "P1004-label-en P1004 label 'MusicBrainz place ID'@en\n", "P1005-label-en P1005 label 'Portuguese National Library ID'@en\n", "P1006-label-en P1006 label 'Nationale Thesaurus voor Auteurs ID'@en\n", "P101-label-en P101 label 'field of work'@en\n", "P1014-label-en P1014 label 'Getty AAT ID'@en\n", "P1015-label-en P1015 label 'NORAF ID'@en\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.label.node1.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the labels of the `node2` nodes" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 8.45 real 2.05 user 1.71 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$LABEL\" \\\n", "--match 'Q154: ()-[]-(n2), label: (n2)-[l]->(n3)' \\\n", "--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \\\n", "-o \"$TEMP\"/Q154.label.node2.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate the two label files" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.66 real 1.52 user 0.10 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.label.node1.tsv.gz -i \"$TEMP\"/Q154.label.node2.tsv.gz \\\n", "-o \"$TEMP\"/labels.tsv.gz" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 56123 289814 3031029\n" ] } ], "source": [ "!zcat < \"$TEMP\"/labels.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the aliases of `node1` nodes" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2.55 real 1.51 user 0.37 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$ALIAS\" \\\n", "--match 'Q154: (n1)-[]-(), alias: (n1)-[l]->(n2)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.alias.node1.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the aliases of `node2` nodes" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3.44 real 1.59 user 0.59 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$ALIAS\" \\\n", "--match 'Q154: ()-[]-(n2), alias: (n2)-[l]->(n3)' \\\n", "--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \\\n", "-o \"$TEMP\"/Q154.alias.node2.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate the two alias files" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.63 real 1.49 user 0.11 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.alias.node1.tsv.gz -i \"$TEMP\"/Q154.alias.node2.tsv.gz \\\n", "-o \"$TEMP\"/alias.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the descriptions of `node1` nodes" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3.09 real 1.11 user 0.52 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$DESCRIPTION\" \\\n", "--match 'Q154: (n1)-[]-(), description: (n1)-[l]->(n2)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.description.node1.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the descriptions of `node2` nodes" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 8.51 real 1.94 user 1.70 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$DESCRIPTION\" \\\n", "--match 'Q154: ()-[]-(n2), description: (n2)-[l]->(n3)' \\\n", "--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \\\n", "-o \"$TEMP\"/Q154.description.node2.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concatenate the two description files" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.67 real 1.48 user 0.11 sys\n" ] } ], "source": [ "!$kgtk cat -i \"$TEMP\"/Q154.description.node1.tsv.gz -i \"$TEMP\"/Q154.description.node2.tsv.gz \\\n", "-o \"$TEMP\"/description.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: get the qualifiers" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.29 real 2.44 user 0.73 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$QUALIFIERS\" \\\n", "--match 'Q154: ()-[l]->(), qual: (l)-[lq]->(n2)' \\\n", "--return 'lq as id, l as node1, lq.label as label, n2 as node2' \\\n", "-o \"$OUT\"/qualifiers.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: error writing to output: Broken pipe\n", "id node1 label node2\n", "P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0 P10-P1855-Q15075950-7eff6d65-0 P10 \"Smoorverliefd 12 september.webm\"\n", "P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0 P10-P1855-Q15075950-7eff6d65-0 P3831 Q622550\n", "P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0 P10-P1855-Q69063653-c8cdb04c-0 P10 \"Couch Commander.webm\"\n", "P10-P1855-Q7378-555592a4-0-P10-8a982d-0 P10-P1855-Q7378-555592a4-0 P10 \"Elephants Dream (2006).webm\"\n", "P10-P2302-Q21502404-d012aef4-0-P1793-f4c2ed-0 P10-P2302-Q21502404-d012aef4-0 P1793 \"(?i).+\\\\\\\\.(webm\\\\|ogv\\\\|ogg\\\\|gif)\"\n", "P10-P2302-Q21502404-d012aef4-0-P2316-Q21502408-0 P10-P2302-Q21502404-d012aef4-0 P2316 Q21502408\n", "P10-P2302-Q21502404-d012aef4-0-P2916-cb0917-0 P10-P2302-Q21502404-d012aef4-0 P2916 'filename with extension: webm, ogg, ogv, or gif (case insensitive)'@en\n", "P10-P2302-Q21510851-5224fe0b-0-P2306-P175-0 P10-P2302-Q21510851-5224fe0b-0 P2306 P175\n", "P10-P2302-Q21510851-5224fe0b-0-P2306-P180-0 P10-P2302-Q21510851-5224fe0b-0 P2306 P180\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.qualifiers.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 109816 446163 10639203\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.qualifiers.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5: consolidate all the files" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2020-12-23 18:28:45-- https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 2617 (2.6K) [text/plain]\n", "Saving to: ‘/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/kgtk.properties.tsv’\n", "\n", "/Users/pedroszekely 100%[===================>] 2.56K --.-KB/s in 0s \n", "\n", "2020-12-23 18:28:46 (14.4 MB/s) - ‘/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/kgtk.properties.tsv’ saved [2617/2617]\n", "\n" ] } ], "source": [ "!wget https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv -O \"$TEMP\"/kgtk.properties.tsv" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 label node2 id\n", "isa label \"is a\"@en isa-label-e79b73\n", "isa alias \"isa\"@en isa-alias-7773c5\n", "isa description \"Instance or subclass relationship\"@en isa-description-0b5cdc\n", "isa P31 Q18616576 isa-P31-Q18616576\n", "isa P31 Q28326461 isa-P31-Q28326461\n", "isa P31 Q18647519 isa-P31-Q18647519\n", "isa data_type wikibase-item isa-data_type-643cc9\n", "P279star label \"is a\"@en P279star-label-e79b73\n", "P279star alias \"isa\"@en P279star-alias-7773c5\n" ] } ], "source": [ "!head \"$TEMP\"/kgtk.properties.tsv | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "check" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "P10-datatype\tP10\tdatatype\tcommonsMedia\n", "P1000-datatype\tP1000\tdatatype\twikibase-item\n", "P1001-datatype\tP1001\tdatatype\twikibase-item\n", "P1002-datatype\tP1002\tdatatype\twikibase-item\n", "P1003-datatype\tP1003\tdatatype\texternal-id\n", "P1004-datatype\tP1004\tdatatype\texternal-id\n", "P1005-datatype\tP1005\tdatatype\texternal-id\n", "P1006-datatype\tP1006\tdatatype\texternal-id\n", "P1007-datatype\tP1007\tdatatype\texternal-id\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$PROPERTY_DATATYPES\" | head" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.26 real 0.76 user 0.11 sys\n" ] } ], "source": [ "!$kypher -i \"$Q154GRAPH\" -i \"$PROPERTY_DATATYPES\" \\\n", "--match 'Q15: (n1)-[]->(), property: (n1)-[l:datatype]->(n2)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "P10-datatype\tP10\tdatatype\tcommonsMedia\n", "P1001-datatype\tP1001\tdatatype\twikibase-item\n", "P1003-datatype\tP1003\tdatatype\texternal-id\n", "P1004-datatype\tP1004\tdatatype\texternal-id\n", "P1005-datatype\tP1005\tdatatype\texternal-id\n", "P1006-datatype\tP1006\tdatatype\texternal-id\n", "P101-datatype\tP101\tdatatype\twikibase-item\n", "P1014-datatype\tP1014\tdatatype\texternal-id\n", "P1015-datatype\tP1015\tdatatype\texternal-id\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz | head" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 8.95 real 11.87 user 0.73 sys\n" ] } ], "source": [ "!$kgtk cat \\\n", "-i \"$TEMP\"/labels.tsv.gz \\\n", "-i \"$TEMP\"/alias.tsv.gz \\\n", "-i \"$TEMP\"/description.tsv.gz \\\n", "-i \"$TEMP\"/Q154.edges.3.tsv.gz \\\n", "-i \"$TEMP\"/kgtk.properties.tsv \\\n", "-i \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz \\\n", "/ compact \\\n", "/ sort2 \\\n", "-o \"$OUT\"/all.tsv.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count(DISTINCT graph_35_c1.\"node1\")\n", "13147\n", " 0.92 real 0.79 user 0.10 sys\n" ] } ], "source": [ "!$kypher -i \"$TEMP\"/Q154.edges.3.tsv.gz \\\n", "--match '(n1)-[]->()' \\\n", "--return 'count(distinct n1)'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 346639 1718566 20581359\n" ] } ], "source": [ "!zcat < \"$OUT\"/all.tsv.gz | wc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: partition the files to follow the conventions KGTK uses for Wikidata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the partition-wikidata notebook to complete this step. This notebook expects an input file that includes all edges and qualifiers together. We also need to specify a directory where partitioned files should be created, and a directory where temporary files can be sent (this should be different from our temp directory as the partition notebook will clear any existing files in this folder)." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts: File exists\n" ] } ], "source": [ "!mkdir $OUT/parts" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 6.40 real 6.18 user 0.16 sys\n" ] } ], "source": [ "!$kgtk cat -i $OUT/all.tsv.gz -i $OUT/qualifiers.tsv.gz -o $TEMP/all_and_qualifiers.tsv.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "P10-P1628-32b85d-7927ece6-0\tP10\tP1628\t\"http://www.w3.org/2006/vcard/ns#Video\"\n", "P10-P1628-acf60d-b8950832-0\tP10\tP1628\t\"https://schema.org/video\"\n", "P10-P1629-Q34508-bcc39400-0\tP10\tP1629\tQ34508\n", "P10-P1659-P1651-c4068028-0\tP10\tP1659\tP1651\n", "P10-P1659-P18-5e4b9c4f-0\tP10\tP1659\tP18\n", "P10-P1659-P4238-d21d1ac0-0\tP10\tP1659\tP4238\n", "P10-P1659-P51-86aca4c5-0\tP10\tP1659\tP51\n", "P10-P1855-Q15075950-7eff6d65-0\tP10\tP1855\tQ15075950\n", "P10-P1855-Q69063653-c8cdb04c-0\tP10\tP1855\tQ69063653\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < $TEMP/all_and_qualifiers.tsv.gz | head" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "fb65c07ac2d747fe83c873ace33123bd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(HTML(value='Executing'), FloatProgress(value=0.0, max=49.0), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "{'cells': [{'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:18.778765',\n", " 'end_time': '2020-12-24T04:01:18.804088',\n", " 'duration': 0.025323,\n", " 'status': 'completed'}},\n", " 'source': '# Partitioning a subset of Wikidata\\n\\nThis notebook illustrates how to partition a Wikidata KGTK edges file.\\n\\nParameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\\n\\n```\\npapermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\\\\n-p wikidata_input_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz \\\\\\n-p wikidata_parts_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts \\\\\\n```\\n\\nHere is a sample of the records that might appear in the input KGTK file:\\n```\\nid\\tnode1\\tlabel\\tnode2\\trank\\tnode2;wikidatatype\\tlang\\nQ1-P1036-418bc4-78f5a565-0\\tQ1\\tP1036\\t\"113\"\\tnormal\\texternal-id\\t\\nQ1-P1343-Q19190511-ab132b87-0 Q1 P1343 Q19190511 normal wikibase-item \\nQ1-P18-92a7b3-0dcac501-0 Q1 P18 \"Hubble ultra deep field.jpg\" normal commonsMedia \\nQ1-P2386-cedfb0-0fdbd641-0 Q1 P2386 +880000000000000000000000Q828224 normal quantity \\nQ1-P580-a2fccf-63cf4743-0 Q1 P580 ^-13798000000-00-00T00:00:00Z/3 normal time \\nQ1-P920-47c0f2-52689c4e-0 Q1 P920 \"LEM201201756\" normal string \\nQ1-P1343-Q19190511-ab132b87-0-P805-Q84065667-0 Q1-P1343-Q19190511-ab132b87-0 P805 Q84065667 wikibase-item \\nQ1-P1343-Q88672152-5080b9e2-0-P304-5724c3-0 Q1-P1343-Q88672152-5080b9e2-0 P304 \"13-36\" string \\nQ1-P2670-Q18343-030eb87e-0-P1107-ce87f8-0 Q1-P2670-Q18343-030eb87e-0 P1107 +0.70 quantity \\nQ1-P793-Q273508-1900d69c-0-P585-a2fccf-0 Q1-P793-Q273508-1900d69c-0 P585 ^-13798000000-00-00T00:00:00Z/3 time \\nP10-alias-en-282226-0 P10 alias \\'gif\\'@en\\nP10-description-en P10 description \\'relevant video. For images, use the property P18. For film trailers, qualify with \\\\\"object has role\\\\\" (P3831)=\\\\\"trailer\\\\\" (Q622550)\\'@en en\\nP10-label-en P10 label \\'video\\'@en en\\nQ1-addl_wikipedia_sitelink-19e42a-0 Q1 addl_wikipedia_sitelink http://enwikiquote.org/wiki/Universe en\\nQ1-addl_wikipedia_sitelink-19e42a-0-language-0 Q1-addl_wikipedia_sitelink-19e42a-0 sitelink-language en en\\nQ1-addl_wikipedia_sitelink-19e42a-0-site-0 Q1-addl_wikipedia_sitelink-19e42a-0 sitelink-site enwikiquote en\\nQ1-addl_wikipedia_sitelink-19e42a-0-title-0 Q1-addl_wikipedia_sitelink-19e42a-0 sitelink-title \"Universe\" en\\nQ1-wikipedia_sitelink-5e459a-0 Q1 wikipedia_sitelink http://en.wikipedia.org/wiki/Universe en\\nQ1-wikipedia_sitelink-5e459a-0-badge-Q17437798 Q1-wikipedia_sitelink-5e459a-0 sitelink-badge Q17437798 en\\nQ1-wikipedia_sitelink-5e459a-0-language-0 Q1-wikipedia_sitelink-5e459a-0 sitelink-language en en\\nQ1-wikipedia_sitelink-5e459a-0-site-0 Q1-wikipedia_sitelink-5e459a-0 sitelink-site enwiki en\\nQ1-wikipedia_sitelink-5e459a-0-title-0 Q1-wikipedia_sitelink-5e459a-0 sitelink-title \"Universe\" en\\n```\\nHere are some contraints on the contents of the input file:\\n- The input file starts with a KGTK header record.\\n - In addition to the `id`, `node1`, `label`, and `node2` columns, the file may contain the `node2;wikidatatype` column.\\n - The `node2;wikidatatype` column is used to partition claims by Wikidata property datatype.\\n - If it does not exist, it will be created during the partitioning process and populated using `datatype` relationships.\\n - If it does exist, any empty values in the column will be populated using `datatype` relationships.\\n- The `id` column must contain a nonempty value.\\n- The first section of an `id` value must be the `node` value for the record.\\n - The qualifier extraction operations depend upon this constraint. \\n- In addition to the claims and qualifiers, the input file is expected to contain:\\n - English language labels for all property entities appearing in the file.\\n- The input file ought to contain the following:\\n - claims records,\\n - qualifier records,\\n - alias records in appropriate languages,\\n - description records in appropriate languages,\\n - label records in appropriate languages, and\\n - sitelink records in appropriate languages.\\n - `datatype` records that map Wikidata property entities to Wikidata property datatypes. These records are required if the input file does not contain the `node2;wikidatatype` column.\\n- Additionally, this script provides for the appearance of `type` records in the input file.\\n - `type` records that list all `entityId` values and identify them as properties or items. These records provides a correctness check on the operation of `kgtk import-wikidata`, and may be deprecated in the future.\\n- The input file is assumed to be unsorted. If it is already sorted on the (`id` `node1` `label` `node2`) columns , then set the `presorted` parameter to `True` to shorten the execution time of this script.'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:18.823948',\n", " 'end_time': '2020-12-24T04:01:18.922481',\n", " 'duration': 0.098533,\n", " 'status': 'completed'}},\n", " 'source': \"### Parameters for invoking the notebook\\n\\n| Parameter | Description | Default |\\n| --------- | ----------- | ------- |\\n| `wikidata_input_path` | A folder containing the Wikidata KGTK edges to partition. | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/data/all.tsv.gz' |\\n| `wikidata_parts_path` | A folder to receive the partitioned Wikidata files, such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |\\n| `temp_folder_path` | A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |\\n| `gzip_command` | The compression command for sorting. | 'pigz' (Note: use version 2.4 or later)|\\n| `kgtk_command` | The kgtk commmand. | 'time kgtk' |\\n| `kgtk_options` | The kgtk commmand options. | '--debug --timing' |\\n| `kgtk_extension` | The file extension for generated KGTK files. Appending `.gz` implies gzip compression. | 'tsv.gz' |\\n| `presorted` | When True, the input file is already sorted on the (`id` `node1` `label` `node2`) columns. | 'False' |\\n| `sort_extras` | Extra parameters for the sort program. The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path |\\n| `use_mgzip` | When True, use the mgzip program where appropriate for faster compression. | 'True' |\\n| `verbose` | When True, produce additional feedback messages. | 'True' |\\n\\nNote: if `pigz` version 2.4 (or later) is not available on your system, use `gzip`.\\n\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 1,\n", " 'metadata': {'tags': ['parameters'],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:18.943621',\n", " 'end_time': '2020-12-24T04:01:18.971367',\n", " 'duration': 0.027746,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:18.968715Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:18.969252Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:18.970542Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:18.971150Z'}},\n", " 'outputs': [],\n", " 'source': \"# Parameters\\nwikidata_input_path = '/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz'\\nwikidata_parts_path = '/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts'\\ntemp_folder_path = wikidata_parts_path + '/temp'\\ngzip_command = 'pigz'\\nkgtk_command = 'time kgtk'\\nkgtk_options = '--debug --timing'\\nkgtk_extension = 'tsv.gz'\\npresorted = 'False'\\nsort_extras = '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path\\nuse_mgzip = 'True'\\nverbose = 'True'\\n\"},\n", " {'cell_type': 'code',\n", " 'metadata': {'tags': ['injected-parameters'],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:18.989412',\n", " 'end_time': '2020-12-24T04:01:19.014298',\n", " 'duration': 0.024886,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:19.011516Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:19.012024Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:19.013826Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:19.014178Z'}},\n", " 'execution_count': 2,\n", " 'source': '# Parameters\\nwikidata_input_path = (\\n \"/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz\"\\n)\\nwikidata_parts_path = \"/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts\"\\ntemp_folder_path = \"/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\"\\nsort_extras = \"--buffer-size 30% --temporary-directory $OUT/parts/temp\"\\nverbose = False\\n',\n", " 'outputs': []},\n", " {'cell_type': 'code',\n", " 'execution_count': 3,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:19.032647',\n", " 'end_time': '2020-12-24T04:01:19.061326',\n", " 'duration': 0.028679,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:19.057283Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:19.057926Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:19.060554Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:19.061208Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': \"wikidata_input_path = '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz'\\nwikidata_parts_path = '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts'\\ntemp_folder_path = '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp'\\ngzip_command = 'pigz'\\nkgtk_command = 'time kgtk'\\nkgtk_options = '--debug --timing'\\nkgtk_extension = 'tsv.gz'\\npresorted = 'False'\\nsort_extras = '--buffer-size 30% --temporary-directory $OUT/parts/temp'\\nuse_mgzip = 'True'\\nverbose = False\\n\"}],\n", " 'source': \"print('wikidata_input_path = %s' % repr(wikidata_input_path))\\nprint('wikidata_parts_path = %s' % repr(wikidata_parts_path))\\nprint('temp_folder_path = %s' % repr(temp_folder_path))\\nprint('gzip_command = %s' % repr(gzip_command))\\nprint('kgtk_command = %s' % repr(kgtk_command))\\nprint('kgtk_options = %s' % repr(kgtk_options))\\nprint('kgtk_extension = %s' % repr(kgtk_extension))\\nprint('presorted = %s' % repr(presorted))\\nprint('sort_extras = %s' % repr(sort_extras))\\nprint('use_mgzip = %s' % repr(use_mgzip))\\nprint('verbose = %s' % repr(verbose))\\n\"},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:19.080737',\n", " 'end_time': '2020-12-24T04:01:19.099938',\n", " 'duration': 0.019201,\n", " 'status': 'completed'}},\n", " 'source': '### Create working folders and empty them'},\n", " {'cell_type': 'code',\n", " 'execution_count': 4,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:19.119819',\n", " 'end_time': '2020-12-24T04:01:19.391809',\n", " 'duration': 0.27199,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:19.143818Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:19.144596Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:19.390949Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:19.391649Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts: File exists\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp: File exists\\r\\n'}],\n", " 'source': '!mkdir {wikidata_parts_path}\\n!mkdir {temp_folder_path}'},\n", " {'cell_type': 'code',\n", " 'execution_count': 5,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:19.420593',\n", " 'end_time': '2020-12-24T04:01:19.692890',\n", " 'duration': 0.272297,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:19.453075Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:19.453750Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:19.691403Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:19.692716Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'rm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/*.tsv: No such file or directory\\r\\nrm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/*.tsv.gz: No such file or directory\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'rm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/*.tsv: No such file or directory\\r\\nrm: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/*.tsv.gz: No such file or directory\\r\\n'}],\n", " 'source': '!rm {wikidata_parts_path}/*.tsv {wikidata_parts_path}/*.tsv.gz\\n!rm {temp_folder_path}/*.tsv {temp_folder_path}/*.tsv.gz'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:19.724442',\n", " 'end_time': '2020-12-24T04:01:19.750139',\n", " 'duration': 0.025697,\n", " 'status': 'completed'}},\n", " 'source': '### Sort the Input Data Unless Presorted\\nSort the input data file by (id, node1, label, node2).\\nThis may take a while.'},\n", " {'cell_type': 'code',\n", " 'execution_count': 6,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:19.772923',\n", " 'end_time': '2020-12-24T04:01:23.550324',\n", " 'duration': 3.777401,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:19.803414Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:19.804062Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:23.549339Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:23.550119Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': \"Sorting the input file '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz'.\\n\"},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:03.387055 CPU=0:00:00.823541 ( 24.3%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/all.tsv.gz --columns id node1 label node2 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m3.627s\\r\\nuser\\t0m3.006s\\r\\nsys\\t0m0.326s\\r\\n'}],\n", " 'source': 'if presorted.lower() == \"true\": \\n print(\\'Using a presorted input file %s.\\' % repr(wikidata_input_path))\\n partition_input_file = wikidata_input_path \\nelse: \\n print(\\'Sorting the input file %s.\\' % repr(wikidata_input_path))\\n partition_input_file = wikidata_parts_path + \\'/all.\\' + kgtk_extension \\n !{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {wikidata_input_path} \\\\\\n --output-file {partition_input_file} \\\\\\n --columns id node1 label node2 \\\\\\n --extra \"{sort_extras}\"'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:23.576957',\n", " 'end_time': '2020-12-24T04:01:23.601527',\n", " 'duration': 0.02457,\n", " 'status': 'completed'}},\n", " 'source': '### Partition the Claims, Qualifiers, and Entity Data\\nSplit out the entity data (alias, description, label, and sitelinks) and additional metadata (datatype, type). Separate the qualifiers from the claims.\\n'},\n", " {'cell_type': 'code',\n", " 'execution_count': 7,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:23.624491',\n", " 'end_time': '2020-12-24T04:01:31.645484',\n", " 'duration': 8.020993,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:23.658652Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:23.677999Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:31.644483Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:31.645314Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:06.827845 CPU=0:00:06.726057 ( 98.5%): filter --verbose=False --use-mgzip=True --first-match-only --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/all.tsv.gz -p ; datatype ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/metadata.property.datatypes.tsv.gz -p ; alias ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/aliases.tsv.gz -p ; description ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/descriptions.tsv.gz -p ; label ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/labels.tsv.gz -p ; addl_wikipedia_sitelink,wikipedia_sitelink ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.tsv.gz -p ; sitelink-badge,sitelink-language,sitelink-site,sitelink-title ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.tsv.gz -p ; type ; -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/metadata.types.tsv.gz --reject-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m7.802s\\r\\nuser\\t0m6.653s\\r\\nsys\\t0m0.252s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --first-match-only \\\\\\n --input-file {partition_input_file} \\\\\\n -p '; datatype ;' -o {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension} \\\\\\n -p '; alias ;' -o {wikidata_parts_path}/aliases.{kgtk_extension} \\\\\\n -p '; description ;' -o {wikidata_parts_path}/descriptions.{kgtk_extension} \\\\\\n -p '; label ;' -o {wikidata_parts_path}/labels.{kgtk_extension} \\\\\\n -p '; addl_wikipedia_sitelink,wikipedia_sitelink ;' \\\\\\n -o {wikidata_parts_path}/sitelinks.{kgtk_extension} \\\\\\n -p '; sitelink-badge,sitelink-language,sitelink-site,sitelink-title ;' \\\\\\n -o {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \\\\\\n -p '; type ;' -o {wikidata_parts_path}/metadata.types.{kgtk_extension} \\\\\\n --reject-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension}\"},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:31.675980',\n", " 'end_time': '2020-12-24T04:01:31.699820',\n", " 'duration': 0.02384,\n", " 'status': 'completed'}},\n", " 'source': '### Sort the claims and qualifiers on Node1\\nSort the combined claims and qualifiers file by the node1 column.\\nThis may take a while.'},\n", " {'cell_type': 'code',\n", " 'execution_count': 8,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:31.721352',\n", " 'end_time': '2020-12-24T04:01:33.048996',\n", " 'duration': 1.327644,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:31.746849Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:31.747450Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:33.047944Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:33.048824Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:01.061974 CPU=0:00:00.680296 ( 64.1%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-node1.tsv.gz --columns node1 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.183s\\r\\nuser\\t0m1.964s\\r\\nsys\\t0m0.176s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension} \\\\\\n --output-file {temp_folder_path}/claims-and-qualifiers.sorted-by-node1.{kgtk_extension}\\\\\\n --columns node1 \\\\\\n --extra \"{sort_extras}\"'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:33.078983',\n", " 'end_time': '2020-12-24T04:01:33.103943',\n", " 'duration': 0.02496,\n", " 'status': 'completed'}},\n", " 'source': \"### Split the claims and qualifiers\\nIf row A's node1 value matches some other row's id value, the then row A is a qualifier.\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 9,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:33.127543',\n", " 'end_time': '2020-12-24T04:01:39.868141',\n", " 'duration': 6.740598,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:33.155629Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:33.156229Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:39.867126Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:39.867972Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:06.268601 CPU=0:00:06.180480 ( 98.6%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-node1.tsv.gz --filter-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/qualifiers.sorted-by-node1.tsv.gz --reject-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.sorted-by-node1.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m6.595s\\r\\nuser\\t0m6.092s\\r\\nsys\\t0m0.225s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {temp_folder_path}/claims-and-qualifiers.sorted-by-node1.{kgtk_extension} \\\\\\n --filter-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension} \\\\\\n --output-file {temp_folder_path}/qualifiers.sorted-by-node1.{kgtk_extension}\\\\\\n --reject-file {temp_folder_path}/claims.sorted-by-node1.{kgtk_extension}\\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:39.913151',\n", " 'end_time': '2020-12-24T04:01:39.941215',\n", " 'duration': 0.028064,\n", " 'status': 'completed'}},\n", " 'source': '### Sort the claims by ID\\nSort the split claims by id, node1, label, node2.\\nThis may take a while.'},\n", " {'cell_type': 'code',\n", " 'execution_count': 10,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:39.965000',\n", " 'end_time': '2020-12-24T04:01:41.342312',\n", " 'duration': 1.377312,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:39.997328Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:39.998314Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:41.341422Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:41.342149Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:01.079537 CPU=0:00:00.685110 ( 63.5%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.sorted-by-node1.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.no-datatype.tsv.gz --columns id node1 label node2 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.226s\\r\\nuser\\t0m1.637s\\r\\nsys\\t0m0.170s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {temp_folder_path}/claims.sorted-by-node1.{kgtk_extension} \\\\\\n --output-file {temp_folder_path}/claims.no-datatype.{kgtk_extension}\\\\\\n --columns id node1 label node2 \\\\\\n --extra \"{sort_extras}\"'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:41.372564',\n", " 'end_time': '2020-12-24T04:01:41.396939',\n", " 'duration': 0.024375,\n", " 'status': 'completed'}},\n", " 'source': '### Merge the Wikidata Property Datatypes into the claims\\nMerge the Wikidata Property Datatypes into the claims row as node2;wikidatatype. This column will be used to partition the claims by Wikidata Property Datatype ina later step. If the claims file already has a node2;wikidatatype column, lift only when that column has an empty value.\\n'},\n", " {'cell_type': 'code',\n", " 'execution_count': 11,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:41.422214',\n", " 'end_time': '2020-12-24T04:01:44.940664',\n", " 'duration': 3.51845,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:41.450977Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:41.451612Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:44.939766Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:44.940503Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:03.010786 CPU=0:00:02.979860 ( 99.0%): lift --verbose=False --use-mgzip=True --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/claims.no-datatype.tsv.gz --columns-to-lift label --overwrite False --label-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/metadata.property.datatypes.tsv.gz --label-value datatype --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tsv.gz --columns-to-write node2;wikidatatype\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m3.375s\\r\\nuser\\t0m2.963s\\r\\nsys\\t0m0.135s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} lift --verbose={verbose} --use-mgzip={use_mgzip} \\\\\\n --input-file {temp_folder_path}/claims.no-datatype.{kgtk_extension} \\\\\\n --columns-to-lift label \\\\\\n --overwrite False \\\\\\n --label-file {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension}\\\\\\n --label-value datatype \\\\\\n --output-file {wikidata_parts_path}/claims.{kgtk_extension}\\\\\\n --columns-to-write 'node2;wikidatatype'\"},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:44.973475',\n", " 'end_time': '2020-12-24T04:01:44.999935',\n", " 'duration': 0.02646,\n", " 'status': 'completed'}},\n", " 'source': '### Sort the qualifiers by ID\\nSort the split qualifiers by id, node1, label, node2.\\nThis may take a while.'},\n", " {'cell_type': 'code',\n", " 'execution_count': 12,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:45.024688',\n", " 'end_time': '2020-12-24T04:01:46.277644',\n", " 'duration': 1.252956,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:45.053828Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:45.054512Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:46.276755Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:46.277477Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:00.971581 CPU=0:00:00.670670 ( 69.0%): sort2 --verbose=False --gzip-command=pigz --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/qualifiers.sorted-by-node1.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --columns id node1 label node2 --extra --buffer-size 30% --temporary-directory /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp\\r\\n\\r\\nreal\\t0m1.109s\\r\\nuser\\t0m1.389s\\r\\nsys\\t0m0.159s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \\\\\\n --input-file {temp_folder_path}/qualifiers.sorted-by-node1.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.{kgtk_extension}\\\\\\n --columns id node1 label node2 \\\\\\n --extra \"{sort_extras}\"'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:46.311060',\n", " 'end_time': '2020-12-24T04:01:46.342233',\n", " 'duration': 0.031173,\n", " 'status': 'completed'}},\n", " 'source': \"### Extract the English aliases, descriptions, labels, and sitelinks.\\nAliases, descriptions, and labels are extracted by selecting rows where the `node2` value ends in the language suffix for English (`@en`) in a KGTK language-qualified string. This is an abbreviated pattern; a more general pattern would include the single quotes used to delimit a KGTK language-qualified string. If `kgtk import-wikidata` has executed properly, the abbreviated pattern should be sufficient.\\n\\nSitelink rows do not have a language-specific marker in the `node2` value. We use the `lang` column to provide the language code for English ('en'). The `lang` column is an additional column created by `kgtk import-wikidata`.\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 13,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:46.370936',\n", " 'end_time': '2020-12-24T04:01:48.107618',\n", " 'duration': 1.736682,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:46.401568Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:46.402217Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:48.106672Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:48.107445Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:01.442662 CPU=0:00:01.420834 ( 98.5%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/aliases.tsv.gz -p ;; ^.*@en$ -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/aliases.en.tsv.gz\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.591s\\r\\nuser\\t0m1.425s\\r\\nsys\\t0m0.117s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \\\\\\n --input-file {wikidata_parts_path}/aliases.{kgtk_extension} \\\\\\n -p ';; ^.*@en$' -o {wikidata_parts_path}/aliases.en.{kgtk_extension}\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 14,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:48.141826',\n", " 'end_time': '2020-12-24T04:01:49.943003',\n", " 'duration': 1.801177,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:48.178097Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:48.178842Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:49.942122Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:49.942839Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:01.462942 CPU=0:00:01.445114 ( 98.8%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/descriptions.tsv.gz -p ;; ^.*@en$ -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/descriptions.en.tsv.gz\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.647s\\r\\nuser\\t0m1.459s\\r\\nsys\\t0m0.126s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \\\\\\n --input-file {wikidata_parts_path}/descriptions.{kgtk_extension} \\\\\\n -p ';; ^.*@en$' -o {wikidata_parts_path}/descriptions.en.{kgtk_extension}\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 15,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:49.976653',\n", " 'end_time': '2020-12-24T04:01:51.707139',\n", " 'duration': 1.730486,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:50.009197Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:50.009746Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:51.706240Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:51.706973Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:01.414010 CPU=0:00:01.399556 ( 99.0%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/labels.tsv.gz -p ;; ^.*@en$ -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/labels.en.tsv.gz\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.578s\\r\\nuser\\t0m1.413s\\r\\nsys\\t0m0.117s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \\\\\\n --input-file {wikidata_parts_path}/labels.{kgtk_extension} \\\\\\n -p ';; ^.*@en$' -o {wikidata_parts_path}/labels.en.{kgtk_extension}\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 16,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:51.741666',\n", " 'end_time': '2020-12-24T04:01:52.912481',\n", " 'duration': 1.170815,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:51.775646Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:51.776326Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:52.911596Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:52.912314Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:00.700093 CPU=0:00:00.693235 ( 99.0%): filter --verbose=False --use-mgzip=True --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.tsv.gz -p ; sitelink-language ; en -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/sitelinks.language.en.tsv.gz\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.021s\\r\\nuser\\t0m0.712s\\r\\nsys\\t0m0.099s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} \\\\\\n --input-file {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \\\\\\n -p '; sitelink-language ; en' -o {temp_folder_path}/sitelinks.language.en.{kgtk_extension}\"},\n", " {'cell_type': 'code',\n", " 'execution_count': 17,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:52.947779',\n", " 'end_time': '2020-12-24T04:01:54.343297',\n", " 'duration': 1.395518,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:52.981706Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:52.982265Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:54.342572Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:54.343081Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:00.827157 CPU=0:00:00.810847 ( 98.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/sitelinks.language.en.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.en.tsv.gz --input-keys id --filter-keys node1\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.247s\\r\\nuser\\t0m0.817s\\r\\nsys\\t0m0.111s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/sitelinks.{kgtk_extension} \\\\\\n --filter-on {temp_folder_path}/sitelinks.language.en.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/sitelinks.en.{kgtk_extension} \\\\\\n --input-keys id \\\\\\n --filter-keys node1'},\n", " {'cell_type': 'code',\n", " 'execution_count': 18,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:54.382473',\n", " 'end_time': '2020-12-24T04:01:55.721828',\n", " 'duration': 1.339355,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:54.423482Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:54.424161Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:55.720973Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:55.721666Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Timing: elapsed=0:00:00.733432 CPU=0:00:00.720798 ( 98.3%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp/sitelinks.language.en.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/sitelinks.qualifiers.en.tsv.gz --input-keys node1 --filter-keys node1\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.180s\\r\\nuser\\t0m0.747s\\r\\nsys\\t0m0.113s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \\\\\\n --filter-on {temp_folder_path}/sitelinks.language.en.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/sitelinks.qualifiers.en.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys node1'},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:55.757919',\n", " 'end_time': '2020-12-24T04:01:55.788849',\n", " 'duration': 0.03093,\n", " 'status': 'completed'}},\n", " 'source': '### Partition the claims by Wikidata Property Datatype\\nWikidata has two names for each Wikidata property datatype: the name that appears in the JSON dump file, and the name that appears in the TTL dump file. `kgtk import-wikidata` currently imports rows from Wikikdata JSON dump files, and these are the names that appear below.\\n\\nThe `part.other` file catches any records that have an unknown Wikidata property datatype. Additional Wikidata property datatypes may occur when processing from certain Wikidata extensions.'},\n", " {'cell_type': 'code',\n", " 'execution_count': 19,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:55.817159',\n", " 'end_time': '2020-12-24T04:01:56.912792',\n", " 'duration': 1.095633,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:55.852920Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:55.853478Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:56.911876Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:56.912620Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Error: Cannot find the object column \\'node2;wikidatatype\\'.\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/filter.py\", line 1169, in run\\r\\n return process_plain()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/filter.py\", line 682, in process_plain\\r\\n raise KGTKException(\"Missing columns.\")\\r\\nkgtk.exceptions.KGTKException: Missing columns.\\r\\nMissing columns.\\r\\nTiming: elapsed=0:00:00.797013 CPU=0:00:00.776128 ( 97.4%): filter --verbose=False --use-mgzip=True --first-match-only --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tsv.gz --obj node2;wikidatatype -p ;; commonsMedia -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz -p ;; external-id -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz -p ;; geo-shape -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz -p ;; globe-coordinate -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz -p ;; math -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz -p ;; monolingualtext -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz -p ;; musical-notation -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz -p ;; quantity -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz -p ;; string -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz -p ;; tabular-data -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz -p ;; time -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz -p ;; url -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz -p ;; wikibase-form -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz -p ;; wikibase-item -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz -p ;; wikibase-lexeme -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz -p ;; wikibase-property -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz -p ;; wikibase-sense -o /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz --reject-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.other.tsv.gz\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.943s\\r\\nuser\\t0m0.787s\\r\\nsys\\t0m0.112s\\r\\n'}],\n", " 'source': \"!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --first-match-only \\\\\\n --input-file {wikidata_parts_path}/claims.{kgtk_extension} \\\\\\n --obj 'node2;wikidatatype' \\\\\\n -p ';; commonsMedia' -o {wikidata_parts_path}/claims.commonsMedia.{kgtk_extension} \\\\\\n -p ';; external-id' -o {wikidata_parts_path}/claims.external-id.{kgtk_extension} \\\\\\n -p ';; geo-shape' -o {wikidata_parts_path}/claims.geo-shape.{kgtk_extension} \\\\\\n -p ';; globe-coordinate' -o {wikidata_parts_path}/claims.globe-coordinate.{kgtk_extension} \\\\\\n -p ';; math' -o {wikidata_parts_path}/claims.math.{kgtk_extension} \\\\\\n -p ';; monolingualtext' -o {wikidata_parts_path}/claims.monolingualtext.{kgtk_extension} \\\\\\n -p ';; musical-notation' -o {wikidata_parts_path}/claims.musical-notation.{kgtk_extension} \\\\\\n -p ';; quantity' -o {wikidata_parts_path}/claims.quantity.{kgtk_extension} \\\\\\n -p ';; string' -o {wikidata_parts_path}/claims.string.{kgtk_extension} \\\\\\n -p ';; tabular-data' -o {wikidata_parts_path}/claims.tabular-data.{kgtk_extension} \\\\\\n -p ';; time' -o {wikidata_parts_path}/claims.time.{kgtk_extension} \\\\\\n -p ';; url' -o {wikidata_parts_path}/claims.url.{kgtk_extension} \\\\\\n -p ';; wikibase-form' -o {wikidata_parts_path}/claims.wikibase-form.{kgtk_extension} \\\\\\n -p ';; wikibase-item' -o {wikidata_parts_path}/claims.wikibase-item.{kgtk_extension} \\\\\\n -p ';; wikibase-lexeme' -o {wikidata_parts_path}/claims.wikibase-lexeme.{kgtk_extension} \\\\\\n -p ';; wikibase-property' -o {wikidata_parts_path}/claims.wikibase-property.{kgtk_extension} \\\\\\n -p ';; wikibase-sense' -o {wikidata_parts_path}/claims.wikibase-sense.{kgtk_extension} \\\\\\n --reject-file {wikidata_parts_path}/claims.other.{kgtk_extension}\"},\n", " {'cell_type': 'markdown',\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:56.949949',\n", " 'end_time': '2020-12-24T04:01:56.979812',\n", " 'duration': 0.029863,\n", " 'status': 'completed'}},\n", " 'source': '### Partition the qualifiers\\nExtract the qualifier records for each of the Wikidata property datatype partition files.'},\n", " {'cell_type': 'code',\n", " 'execution_count': 20,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:57.007807',\n", " 'end_time': '2020-12-24T04:01:58.096484',\n", " 'duration': 1.088677,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:57.039775Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:57.040379Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:58.095586Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:58.096316Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.783377 CPU=0:00:00.703171 ( 89.8%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.commonsMedia.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.commonsMedia.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.940s\\r\\nuser\\t0m0.721s\\r\\nsys\\t0m0.106s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.commonsMedia.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.commonsMedia.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 21,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:58.126907',\n", " 'end_time': '2020-12-24T04:01:59.196539',\n", " 'duration': 1.069632,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:58.160443Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:58.160980Z',\n", " 'iopub.status.idle': '2020-12-24T04:01:59.195572Z',\n", " 'shell.execute_reply': '2020-12-24T04:01:59.196302Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.700827 CPU=0:00:00.686953 ( 98.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.external-id.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.external-id.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.918s\\r\\nuser\\t0m0.703s\\r\\nsys\\t0m0.099s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.external-id.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.external-id.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 22,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:01:59.226546',\n", " 'end_time': '2020-12-24T04:02:00.286191',\n", " 'duration': 1.059645,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:01:59.261423Z',\n", " 'iopub.execute_input': '2020-12-24T04:01:59.262067Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:00.285460Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:00.286054Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.693817 CPU=0:00:00.680174 ( 98.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.geo-shape.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.geo-shape.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.910s\\r\\nuser\\t0m0.693s\\r\\nsys\\t0m0.102s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.geo-shape.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.geo-shape.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 23,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:00.317067',\n", " 'end_time': '2020-12-24T04:02:01.376710',\n", " 'duration': 1.059643,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:00.353914Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:00.354444Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:01.375736Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:01.376541Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.685847 CPU=0:00:00.674586 ( 98.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.globe-coordinate.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.globe-coordinate.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.908s\\r\\nuser\\t0m0.695s\\r\\nsys\\t0m0.100s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.globe-coordinate.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.globe-coordinate.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 24,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:01.417421',\n", " 'end_time': '2020-12-24T04:02:02.487716',\n", " 'duration': 1.070295,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:01.454499Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:01.455052Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:02.486818Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:02.487549Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.692024 CPU=0:00:00.686177 ( 99.2%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.math.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.math.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.915s\\r\\nuser\\t0m0.710s\\r\\nsys\\t0m0.098s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.math.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.math.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 25,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:02.527005',\n", " 'end_time': '2020-12-24T04:02:03.618049',\n", " 'duration': 1.091044,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:02.568577Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:02.569635Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:03.617155Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:03.617884Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.683551 CPU=0:00:00.673260 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.monolingualtext.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.monolingualtext.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.930s\\r\\nuser\\t0m0.713s\\r\\nsys\\t0m0.106s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.monolingualtext.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.monolingualtext.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 26,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:03.658843',\n", " 'end_time': '2020-12-24T04:02:04.744693',\n", " 'duration': 1.08585,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:03.696451Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:03.697009Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:04.743613Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:04.744387Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.713679 CPU=0:00:00.704594 ( 98.7%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.musical-notation.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.musical-notation.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.930s\\r\\nuser\\t0m0.720s\\r\\nsys\\t0m0.102s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.musical-notation.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.musical-notation.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 27,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:04.786691',\n", " 'end_time': '2020-12-24T04:02:05.968529',\n", " 'duration': 1.181838,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:04.830352Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:04.830926Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:05.967605Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:05.968364Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.779701 CPU=0:00:00.767860 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.quantity.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.quantity.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.020s\\r\\nuser\\t0m0.794s\\r\\nsys\\t0m0.112s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.quantity.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.quantity.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 28,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:06.012197',\n", " 'end_time': '2020-12-24T04:02:07.098719',\n", " 'duration': 1.086522,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:06.051402Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:06.051944Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:07.097812Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:07.098546Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.697405 CPU=0:00:00.686692 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.string.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.string.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.929s\\r\\nuser\\t0m0.713s\\r\\nsys\\t0m0.101s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.string.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.string.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 29,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:07.142287',\n", " 'end_time': '2020-12-24T04:02:08.230149',\n", " 'duration': 1.087862,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:07.180935Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:07.181576Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:08.229186Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:08.229986Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.710830 CPU=0:00:00.698269 ( 98.2%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.tabular-data.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tabular-data.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.932s\\r\\nuser\\t0m0.719s\\r\\nsys\\t0m0.099s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.tabular-data.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.tabular-data.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 30,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:08.273591',\n", " 'end_time': '2020-12-24T04:02:09.458369',\n", " 'duration': 1.184778,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:08.320840Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:08.321416Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:09.457121Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:09.458110Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.778076 CPU=0:00:00.765454 ( 98.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.time.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.time.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.019s\\r\\nuser\\t0m0.792s\\r\\nsys\\t0m0.114s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.time.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.time.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 31,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:09.509146',\n", " 'end_time': '2020-12-24T04:02:10.643175',\n", " 'duration': 1.134029,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:09.568010Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:09.569190Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:10.642292Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:10.643010Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.699523 CPU=0:00:00.688743 ( 98.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.url.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.url.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.956s\\r\\nuser\\t0m0.726s\\r\\nsys\\t0m0.115s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.url.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.url.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 32,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:10.687437',\n", " 'end_time': '2020-12-24T04:02:11.759828',\n", " 'duration': 1.072391,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:10.728108Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:10.728664Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:11.758859Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:11.759666Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.690773 CPU=0:00:00.679523 ( 98.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-form.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-form.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.916s\\r\\nuser\\t0m0.703s\\r\\nsys\\t0m0.099s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.wikibase-form.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-form.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 33,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:11.804768',\n", " 'end_time': '2020-12-24T04:02:12.920111',\n", " 'duration': 1.115343,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:11.846434Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:11.847239Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:12.919170Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:12.919936Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.726953 CPU=0:00:00.713173 ( 98.1%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-item.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.956s\\r\\nuser\\t0m0.734s\\r\\nsys\\t0m0.105s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.wikibase-item.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-item.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 34,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:12.997290',\n", " 'end_time': '2020-12-24T04:02:14.332264',\n", " 'duration': 1.334974,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:13.075096Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:13.075767Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:14.331204Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:14.332022Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.837385 CPU=0:00:00.819865 ( 97.9%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-lexeme.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-lexeme.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m1.136s\\r\\nuser\\t0m0.864s\\r\\nsys\\t0m0.136s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.wikibase-lexeme.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-lexeme.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 35,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:14.384451',\n", " 'end_time': '2020-12-24T04:02:15.564758',\n", " 'duration': 1.180307,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:14.447325Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:14.448319Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:15.564006Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:15.564619Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.738577 CPU=0:00:00.726217 ( 98.3%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-property.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-property.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.995s\\r\\nuser\\t0m0.760s\\r\\nsys\\t0m0.116s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.wikibase-property.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-property.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'},\n", " {'cell_type': 'code',\n", " 'execution_count': 36,\n", " 'metadata': {'tags': [],\n", " 'papermill': {'exception': False,\n", " 'start_time': '2020-12-24T04:02:15.607523',\n", " 'end_time': '2020-12-24T04:02:16.682153',\n", " 'duration': 1.07463,\n", " 'status': 'completed'},\n", " 'execution': {'iopub.status.busy': '2020-12-24T04:02:15.650863Z',\n", " 'iopub.execute_input': '2020-12-24T04:02:15.651397Z',\n", " 'iopub.status.idle': '2020-12-24T04:02:16.681159Z',\n", " 'shell.execute_reply': '2020-12-24T04:02:16.681914Z'}},\n", " 'outputs': [{'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': 'Traceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 257, in run\\r\\n ie.process()\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/iff/kgtkifexists.py\", line 774, in process\\r\\n very_verbose=self.very_verbose,\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 534, in open\\r\\n source: ClosableIter[str] = cls._openfile(file_path, options=options, error_file=error_file, verbose=verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 749, in _openfile\\r\\n verbose)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/io/kgtkreader.py\", line 668, in _open_compressed_file\\r\\n return mgzip.open(str(file_or_path), mode=\"rt\", thread=mgzip_threads) # type: ignore\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 55, in open\\r\\n binary_file = MultiGzipFile(filename, gz_mode, compresslevel, thread=thread, blocksize=blocksize)\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/mgzip/multiProcGzip.py\", line 143, in __init__\\r\\n fileobj = self.myfileobj = builtins.open(filename, mode or \\'rb\\', blocksize)\\r\\nFileNotFoundError: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz\\'\\r\\n\\r\\nDuring handling of the above exception, another exception occurred:\\r\\n\\r\\nTraceback (most recent call last):\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/exceptions.py\", line 46, in __call__\\r\\n return_code = func(*args, **kwargs) or 0\\r\\n File \"/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk/cli/ifexists.py\", line 264, in run\\r\\n raise KGTKException(str(e))\\r\\nkgtk.exceptions.KGTKException: [Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz\\'\\r\\n[Errno 2] No such file or directory: \\'/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz\\'\\r\\nTiming: elapsed=0:00:00.689980 CPU=0:00:00.678144 ( 98.3%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.tsv.gz --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-sense.tsv.gz --output-file /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/qualifiers.wikibase-sense.tsv.gz --input-keys node1 --filter-keys id\\r\\n'},\n", " {'output_type': 'stream',\n", " 'name': 'stdout',\n", " 'text': '\\r\\nreal\\t0m0.912s\\r\\nuser\\t0m0.700s\\r\\nsys\\t0m0.098s\\r\\n'}],\n", " 'source': '!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \\\\\\n --input-file {wikidata_parts_path}/qualifiers.{kgtk_extension} \\\\\\n --filter-on {wikidata_parts_path}/claims.wikibase-sense.{kgtk_extension} \\\\\\n --output-file {wikidata_parts_path}/qualifiers.wikibase-sense.{kgtk_extension} \\\\\\n --input-keys node1 \\\\\\n --filter-keys id'}],\n", " 'metadata': {'kernelspec': {'display_name': 'Python 3',\n", " 'language': 'python',\n", " 'name': 'python3'},\n", " 'language_info': {'name': 'python',\n", " 'version': '3.7.9',\n", " 'mimetype': 'text/x-python',\n", " 'codemirror_mode': {'name': 'ipython', 'version': 3},\n", " 'pygments_lexer': 'ipython3',\n", " 'nbconvert_exporter': 'python',\n", " 'file_extension': '.py'},\n", " 'papermill': {'default_parameters': {},\n", " 'parameters': {'wikidata_input_path': '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/all_and_qualifiers.tsv.gz',\n", " 'wikidata_parts_path': '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts',\n", " 'temp_folder_path': '/Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/temp',\n", " 'sort_extras': '--buffer-size 30% --temporary-directory $OUT/parts/temp',\n", " 'verbose': False},\n", " 'environment_variables': {},\n", " 'version': '2.2.2',\n", " 'input_path': '/Users/pedroszekely/Documents/GitHub/kgtk/examples/partition-wikidata.ipynb',\n", " 'output_path': '/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/partition-wikidata.out.ipynb',\n", " 'start_time': '2020-12-24T04:01:12.363465',\n", " 'end_time': '2020-12-24T04:02:16.945647',\n", " 'duration': 64.582182,\n", " 'exception': None}},\n", " 'nbformat': 4,\n", " 'nbformat_minor': 4}" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pm.execute_notebook(\n", " os.environ[\"EXAMPLES_DIR\"] + \"/partition-wikidata.ipynb\",\n", " os.environ[\"TEMP\"] + \"/partition-wikidata.out.ipynb\",\n", " parameters=dict(\n", " wikidata_input_path = os.environ[\"TEMP\"] + \"/all_and_qualifiers.tsv.gz\",\n", " wikidata_parts_path = os.environ[\"OUT\"] + \"/parts\",\n", " temp_folder_path = os.environ[\"OUT\"] + \"/parts/temp\",\n", " sort_extras = \"--buffer-size 30% --temporary-directory $OUT/parts/temp\",\n", " verbose = False\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The partition-wikidata notebook created the following partitioned kgtk-files:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "aliases.en.tsv.gz metadata.property.datatypes.tsv.gz\n", "aliases.tsv.gz metadata.types.tsv.gz\n", "all.tsv.gz qualifiers.tsv.gz\n", "claims.tsv.gz sitelinks.en.tsv.gz\n", "descriptions.en.tsv.gz sitelinks.qualifiers.en.tsv.gz\n", "descriptions.tsv.gz sitelinks.qualifiers.tsv.gz\n", "labels.en.tsv.gz sitelinks.tsv.gz\n", "labels.tsv.gz \u001b[34mtemp\u001b[m\u001b[m\n" ] } ], "source": [ "!ls $OUT/parts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count(DISTINCT graph_36_c1.\"node1\")\n", "13153\n", " 2.61 real 2.55 user 0.37 sys\n" ] } ], "source": [ "!$kypher -i $OUT/parts/claims.tsv.gz \\\n", "--match '(n1)-[]->()' \\\n", "--return 'count(distinct n1)'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Graph Embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normally, we would use `Q154ITEM`, but the partioning failed so we will compute it using kypher" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/bash: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz: No such file or directory\n" ] } ], "source": [ "!zcat < \"$Q154ITEM\" | head" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 197521 811687 10791930\n" ] } ], "source": [ "!zcat < \"$TEMP\"/Q154.edges.3.tsv.gz | wc" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.83 real 0.66 user 0.15 sys\n" ] } ], "source": [ "!$kypher -i \"$TEMP\"/Q154.edges.3.tsv.gz -i \"$TEMP\"/Q154.metadata.property.datatype.tsv.gz -i \"$Q154LABEL\" \\\n", "--match 'edges: (n1)-[l {label: property}]->(n2), datatype: (property)-[]->(dt:`wikibase-item`), label: (n1)-[]->(lab)' \\\n", "--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$GE\"/geinput.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have over 60,000 lines:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 66490 265960 3297462 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv\n" ] } ], "source": [ "!wc \"$GE\"/geinput.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the graph embeddings using the default settings. Our output file `translation.txt` will be in word2vec format so we can usi it diectly in gensim" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In Processing, Please go to /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/ge.log to check details\n", "Opening the input file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv\n", "KgtkReader: File_path.suffix: .tsv\n", "KgtkReader: reading file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv\n", "header: id\tnode1\tlabel\tnode2\n", "node1 column found, this is a KGTK edge file\n", "KgtkReader: Special columns: node1=1 label=2 node2=3 id=0\n", "KgtkReader: Reading an edge file.\n", "Opening the output file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv\n", "File_path.suffix: .tsv\n", "KgtkWriter: writing file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv\n", "header: id\tnode1\tlabel\tnode2\n", "Processing the input records.\n", "Processed 66489 records.\n", "Processed Finished.\n", " 193.64 real 958.24 user 107.56 sys\n" ] } ], "source": [ "!$kgtk graph-embeddings --verbose -i \"$GE\"/geinput.tsv \\\n", "-o \"$GE\"/embeddings.txt \\\n", "--retain_temporary_data True \\\n", "--operator translation \\\n", "--workers 5 \\\n", "--log \"$GE\"/ge.log \\\n", "-T \"$GE\" \\\n", "-ot w2v \\\n", "-e 300" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the output direcory" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 446864\n", "-rw-r--r-- 1 pedroszekely staff 101K Dec 26 16:09 Q27.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 44K Dec 25 22:18 Q27.tsv\n", "-rw-r--r-- 1 pedroszekely staff 177K Dec 26 16:09 Q29.Q45.Q142.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 43K Dec 25 22:36 Q29.Q45.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 85K Dec 26 16:09 Q29.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 79K Dec 26 16:09 Q332378.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 88K Dec 26 16:09 Q374.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 87K Dec 26 16:09 Q502268.sim.tsv\n", "-rw-r--r-- 1 pedroszekely staff 44K Dec 25 22:11 Q502268.tsv\n", "-rw-r--r-- 1 pedroszekely staff 4.3K Dec 25 21:33 Q610672.tsv\n", "-rw-r--r-- 1 pedroszekely staff 53M Dec 23 23:23 embeddings.txt\n", "-rw-r--r-- 1 pedroszekely staff 480K Dec 23 23:23 ge.log\n", "-rw-r--r-- 1 pedroszekely staff 3.1M Dec 23 22:02 geinput.tsv\n", "-rw-r--r-- 1 pedroszekely staff 973K Dec 23 12:41 geinput.tsv.gz\n", "drwxr-xr-x 10 pedroszekely staff 320B Dec 23 23:23 \u001b[34moutput\u001b[m\u001b[m\n", "-rw-r--r-- 1 pedroszekely staff 19K Dec 26 22:14 projector.qnodes.tsv\n", "-rw-r--r-- 1 pedroszekely staff 2.7M Dec 26 22:14 projector.vectors.tsv\n", "-rw-r--r--@ 1 pedroszekely staff 4.9K Dec 23 15:21 test.txt\n", "-rw-r--r-- 1 pedroszekely staff 1.2M Dec 23 23:20 tmp_geinput.tsv\n", "-rw-r--r-- 1 pedroszekely staff 11K Dec 23 16:22 translation.10.tsv\n", "-rw-r--r-- 1 pedroszekely staff 8.2K Dec 23 21:50 translation.1000.projector.metadata.1.tsv\n", "-rw-r--r-- 1 pedroszekely staff 29K Dec 23 23:00 translation.1000.projector.metadata.tsv\n", "-rw-r--r-- 1 pedroszekely staff 1.2M Dec 23 21:50 translation.1000.projector.vectors.tsv\n", "-rw-r--r-- 1 pedroszekely staff 1.2M Dec 23 21:50 translation.1000.tsv\n", "-rw-r--r-- 1 pedroszekely staff 1.2M Dec 23 20:59 translation.1000.txt\n", "-rw-r--r-- 1 pedroszekely staff 622K Dec 23 23:34 translation.10000.projector.metadata.tsv\n", "-rw-r--r-- 1 pedroszekely staff 12M Dec 23 23:23 translation.10000.projector.vectors.tsv\n", "-rw-r--r-- 1 pedroszekely staff 12M Dec 23 23:23 translation.10000.tsv\n", "-rw-r--r-- 1 pedroszekely staff 143K Dec 23 23:07 translation.5000.projector.metadata.tsv\n", "-rw-r--r-- 1 pedroszekely staff 6.0M Dec 23 23:07 translation.5000.projector.vectors.tsv\n", "-rw-r--r-- 1 pedroszekely staff 6.0M Dec 23 23:07 translation.5000.tsv\n", "-rw-r--r-- 1 pedroszekely staff 114K Dec 26 22:10 translation.projector.metadata.tsv\n", "-rw-r--r-- 1 pedroszekely staff 83K Dec 26 21:28 translation.projector.qnodes.lab.des.tsv\n", "-rw-r--r-- 1 pedroszekely staff 19K Dec 26 22:10 translation.projector.qnodes.tsv\n", "-rw-r--r-- 1 pedroszekely staff 2.7M Dec 26 22:10 translation.projector.vectors.tsv\n", "-rw-r--r-- 1 pedroszekely staff 54M Dec 23 20:25 translation.tsv\n", "-rw-r--r-- 1 pedroszekely staff 54M Dec 23 15:23 translation2.txt\n", "-rw-r--r-- 1 pedroszekely staff 7.9K Dec 23 21:58 xxx.txt\n" ] } ], "source": [ "!ls -hl \"$GE\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's peek at the file, we have 44K vectors of dimension 100" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44419 100\n", "Q243611 -0.331411451 -0.152568206 -0.139386058 -0.121394955 -0.334799886 0.023394363 -0.024942441 -0.137579590 0.084599547 0.876167953 -0.222018719 -0.168754980 -0.027932534 -0.289450347 0.250572681 -0.633476973 -0.440892249 -0.178823337 0.299026161 -0.407618254 -0.036977571 0.032356881 -0.081695572 -0.055025205 -0.182957411 -0.250380307 0.535348237 -0.108279251 0.452128828 -0.346319675 0.042611640 0.338040203 0.171208084 -0.275558919 0.114576176 -0.198427215 -0.277292132 -0.149741501 -0.327517658 0.146066576 0.431715995 0.481242269 -0.124767415 -0.171481445 -0.394009471 -0.305026233 0.223357961 0.360154629 0.213194653 0.012373813 -0.405227572 0.052000813 0.084122777 0.072465442 0.241527051 0.314641565 -0.258469820 0.122197300 -0.385967076 -0.472052187 -0.090907939 -0.102187648 0.184509873 0.132856295 0.402841479 0.585462868 0.695401728 0.060416430 -0.322626084 -0.238338873 0.333650321 0.479767382 -0.366145641 0.051905960 0.275238752 0.429640323 -0.370602965 0.055560533 0.609016299 -0.264090836 0.130152687 -0.186686888 0.346337169 -0.695047677 -0.011451115 -0.673357785 -0.533024371 0.064912595 0.069889240 -0.252222359 -0.089250244 -0.509508848 0.427851468 0.018754318 -0.192092314 -0.222673357 -0.156975567 -0.142941862 0.170732170 0.495883286\n" ] } ], "source": [ "!head -2 \"$GE\"/embeddings.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the vecotrs in gensim" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "path = os.environ['GE'] + \"/embeddings.txt\"\n", "ge_vectors = KeyedVectors.load_word2vec_format(path, binary=False)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-0.71844614, -0.72041976, 0.819834 , -0.07249352, 0.24403723,\n", " 0.60705996, -0.5666862 , -0.5559557 , 0.686424 , 0.6667965 ,\n", " -0.46009716, 0.4207767 , -0.17946522, -0.18458156, -1.0764353 ,\n", " 1.056981 , -0.06046142, 0.00866301, -0.02163753, -0.3418129 ,\n", " -0.03871485, -0.14953642, 0.8018838 , 0.19381396, -0.10066328,\n", " 0.884025 , -0.08962934, -0.36985362, -0.3394345 , 0.671762 ,\n", " 0.11509704, -0.6489555 , -0.22910565, -0.6392556 , 0.8204702 ,\n", " -0.260422 , 0.4548083 , 0.06683284, -0.09605702, 0.23433112,\n", " 0.4129733 , 0.05630195, -0.24607319, -0.19756897, 0.3878965 ,\n", " 0.08242382, 0.07034106, 0.14290804, 0.07523334, -0.16040339,\n", " 0.02874546, -0.0554648 , 0.00764391, -0.6856189 , -0.3701922 ,\n", " -0.23979117, 0.26580626, 0.01087183, -1.2511953 , 0.01297893,\n", " -0.23593499, -0.16515297, -0.2442124 , -0.10745924, 1.16383 ,\n", " -0.8887456 , 0.7308084 , -0.02755331, 1.395485 , -0.34370282,\n", " 0.61988074, 0.28472528, -0.51778364, -0.5608775 , 0.6496688 ,\n", " -0.11930947, -0.4032322 , 1.1153812 , -0.9912186 , 0.09023302,\n", " -0.3542225 , 0.24804258, 0.26503336, -0.6374534 , 0.13950008,\n", " -0.47777557, 0.77702343, 0.0645401 , -0.16665687, -0.37595555,\n", " 0.70249134, -0.77693635, 0.2853018 , 0.35154393, -0.03257728,\n", " -1.2317531 , -0.41577864, -0.73989207, 1.072565 , -0.0718146 ],\n", " dtype=float32)" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q502268 is Johnnie Walker\n", "ge_vectors['Q502268']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the most similar qnodes to `Q15874936`, the qnode for Michelob." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Q610672', 0.9267997741699219),\n", " ('Q48799234', 0.7637178897857666),\n", " ('Q85269976', 0.762772262096405),\n", " ('Q5647008', 0.7582801580429077),\n", " ('Q5149389', 0.7565429210662842)]" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ge_vectors.most_similar(positive=['Q15874936'], topn=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is hard to use because the reuslt are qnodes and we have no idea what they are. Let's define a function to fetch the labels and descriptions so that we can interpret the results more easily" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`kgtk_most_similar` is a wrapper to gensim's `most_similar` function, and it is designed to output the results in KGTK format. The `kgtk_path` is required if we want to output the labels and descriptios as this path is where the `labels.en.tsv.gz` and `descriptions.en.tsv.gz` files care stored. You can optionally provide a `output_path` to tell it to sotre the results in a file; otherwise the results will be returned as a dataframe." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "def kgtk_most_similar(\n", " vectors,\n", " positive,\n", " relation_label=\"similarity_score\",\n", " kg_path=None,\n", " add_label_description=True,\n", " output_path=None,\n", " topn=25,\n", "):\n", " \"\"\"\"\"\"\n", " result = []\n", " if add_label_description and kg_path:\n", " fp = tempfile.NamedTemporaryFile(\n", " mode=\"w\", suffix=\".tsv\", delete=False, encoding=\"utf-8\"\n", " )\n", " fp.write(\"node1\\tlabel\\tnode2\\n\")\n", " for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n", " fp.write(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n", " filename = fp.name\n", " fp.close()\n", "\n", " os.environ[\"_label_graph\"] = kg_path + \"/labels.en.tsv.gz\"\n", " os.environ[\"_description_graph\"] = kg_path + \"/descriptions.en.tsv.gz\"\n", " os.environ[\"_temp_file\"] = filename\n", "\n", " result = !$kypher_raw -i \"$_label_graph\" -i \"$_description_graph\" -i \"$_temp_file\" --as sim \\\n", "--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \\\n", "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, lab as `node1;label`, des as `node1;description`' \\\n", "--order-by 'cast(similarity, float) desc' \n", " \n", " os.remove(filename)\n", " \n", " else:\n", " result.append(\"node1\\tlabel\\tnode2\\n\")\n", " for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):\n", " result.append(\"{}\\t{}\\t{}\\n\".format(qnode, relation_label, similarity))\n", "\n", " if output_path:\n", " handle = open(output_path, \"w\")\n", " for line in result:\n", " handle.write(line)\n", " handle.write(\"\\n\")\n", " handle.close()\n", " else:\n", " columns = result[0].split(\"\\t\")\n", " data = []\n", " for line in result[1:]:\n", " data.append(line.split(\"\\t\"))\n", " return pd.DataFrame(data, columns=columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's give it a try:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q6106720.9267997741699219similarity'Budweiser'@en'brand of pale lager'@en
1Q487992340.7637178897857666similarity'Virginia Black Whiskey'@en'super-premium brand of American Bourbon whisk...
2Q852699760.762772262096405similarity'Busch Beer'@en'brand of beer owned by Anheuser-Busch'@en
3Q51493890.7565429210662842similarity'Colt 45'@en'malt liquor'@en
4Q30799900.752647340297699similarity'Four Loko'@en'Drink'@en
5Q969523630.7438719272613525similarity'Cronk'@en'American drink'@en
6Q70855330.7436875104904175similarity'Olde English 800'@en'malt liquor'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q610672 0.9267997741699219 similarity 'Budweiser'@en \n", "1 Q48799234 0.7637178897857666 similarity 'Virginia Black Whiskey'@en \n", "2 Q85269976 0.762772262096405 similarity 'Busch Beer'@en \n", "3 Q5149389 0.7565429210662842 similarity 'Colt 45'@en \n", "4 Q3079990 0.752647340297699 similarity 'Four Loko'@en \n", "5 Q96952363 0.7438719272613525 similarity 'Cronk'@en \n", "6 Q7085533 0.7436875104904175 similarity 'Olde English 800'@en \n", "\n", " node1;description \n", "0 'brand of pale lager'@en \n", "1 'super-premium brand of American Bourbon whisk... \n", "2 'brand of beer owned by Anheuser-Busch'@en \n", "3 'malt liquor'@en \n", "4 'Drink'@en \n", "5 'American drink'@en \n", "6 'malt liquor'@en " ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q15874936 is Michelob\n", "kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text embeddings" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < $OUT/all.tsv.gz | head -500 > $TEMP/all.500.tsv" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\n", "P10-P1628-32b85d-7927ece6-0\tP10\tP1628\t\"http://www.w3.org/2006/vcard/ns#Video\"\n", "P10-P1628-acf60d-b8950832-0\tP10\tP1628\t\"https://schema.org/video\"\n", "P10-P1629-Q34508-bcc39400-0\tP10\tP1629\tQ34508\n", "P10-P1659-P1651-c4068028-0\tP10\tP1659\tP1651\n", "P10-P1659-P18-5e4b9c4f-0\tP10\tP1659\tP18\n", "P10-P1659-P4238-d21d1ac0-0\tP10\tP1659\tP4238\n", "P10-P1659-P51-86aca4c5-0\tP10\tP1659\tP51\n", "P10-P1855-Q15075950-7eff6d65-0\tP10\tP1855\tQ15075950\n", "P10-P1855-Q69063653-c8cdb04c-0\tP10\tP1855\tQ69063653\n" ] } ], "source": [ "!head $TEMP/all.500.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explain the command here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!$kgtk text-embedding -i $OUT/all.tsv.gz \\\n", "--embedding-projector-metadata-path none \\\n", "--label-properties label \\\n", "--isa-properties P31 P279 P452 P106 \\\n", "--description-properties description \\\n", "--property-value P186 P17 P127 P176 P169 \\\n", "--has-properties \"\" \\\n", "-f kgtk_format \\\n", "--output-data-format kgtk_format \\\n", "--save-embedding-sentence \\\n", "--model bert-large-nli-cls-token \\\n", "-o \"$TE\" \\\n", "> \"$TE\"/text-embedding.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Duration --parallel 1\n", "16348.11 real 16066.21 user 315.45 sys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text embeddings are output in KGTK format and we need them in word2vec format (need to enhance the command to produce w2v format). For now, define a function to convert the KGTK embeddings to w2v format." ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "def convert_kgtk_to_w2v(input_path, output_path, text_embedding_label=\"text_embedding\"):\n", " \"\"\"\n", " Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format\n", " \"\"\"\n", " vector_count = 0\n", " vector_length = 0\n", " \n", " # Read the file once to count the lines as we need to put them at the top of the w2v file\n", " with open(input_path, \"r\") as kgtk_file:\n", " next(kgtk_file)\n", " for line in kgtk_file:\n", " items = line.split(\"\\t\")\n", " qnode = items[0]\n", " label = items[1]\n", " if label == text_embedding_label:\n", " if vector_count == 0:\n", " vector_length = len(items[2].split(\",\"))\n", " vector_count += 1\n", " kgtk_file.close()\n", "\n", " with open(output_path, \"w\") as w2v_file:\n", " w2v_file.write(\"{} {}\\n\".format(vector_count, vector_length))\n", " with open(input_path, \"r\") as kgtk_file:\n", " next(kgtk_file)\n", " for line in kgtk_file:\n", " items = line.split(\"\\t\")\n", " qnode = items[0]\n", " label = items[1]\n", " if label == text_embedding_label:\n", " vector = items[2].replace(\",\", \" \")\n", " w2v_file.write(qnode + \" \" + vector)\n", " kgtk_file.close()\n", " w2v_file.close()" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "convert_kgtk_to_w2v(os.environ['TE'] + \"/text-embedding.tsv\", os.environ['TE'] + \"/embeddings.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the output file, the embeddings have 1024 dimensions" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "56017 1024\n", "undirected_pagerank -0.42267796 0.3995441 0.5533569 -0.71286017 0.35639343 0.23904479 -0.2763573 0.37157294 -0.4283453 1.3224101 0.6862846 0.19590487 -0.6082015 -0.11240994 0.33890438 -0.20922732 -0.23069456 -0.021294963 -1.912606 0.49719235 0.6929876 0.011938913 -1.5600294 0.20473605 -0.17875122 0.45237 -0.09061487 0.0838695 0.039139077 -0.5781012 -0.2535121 0.065458305 -0.34608266 -0.42478928 -0.4474916 -0.23409875 -0.13160512 -0.076800026 -0.6984711 0.12516521 -0.42880625 -0.85138726 0.04815936 -0.6207587 -0.08866266 -1.6658425 -0.51067406 -0.34878105 0.33144328 -0.69933593 -0.36479193 -0.6388813 0.76048696 0.12395467 -0.88557744 0.34427696 1.2574033 -0.65131736 -0.9506962 0.6257681 0.36623836 0.716814 0.36953598 -1.3571995 0.2660646 -1.2076085 0.09180403 -0.36115 0.42118248 -0.92440283 -0.32160524 -0.14557533 -0.50016695 -0.12131537 -0.74813855 0.5254087 0.42912796 -0.73770857 -0.39519224 1.1647401 0.63930184 -0.33095387 -0.17238976 0.19148383 -0.31919938 -0.7583614 0.15933603 1.0313777 0.27520698 -0.4556464 -0.63495463 -0.1864288 0.6013224 0.637127 -0.07590211 0.7430643 0.06540778 -0.0065790126 0.44254926 0.27115446 0.37154993 0.022709582 -0.73920345 0.71504974 -0.04737445 0.3215596 0.14265373 0.0013700873 -0.67682695 0.42491677 0.9620013 -0.2962407 0.40307814 0.4662022 0.38908783 -0.6515235 -0.6724364 0.20429769 0.09426039 0.10870178 -0.50047547 -0.16897413 -0.29538417 0.18928146 0.87492365 0.13553919 -0.8622958 0.21274589 -0.683947 0.36040968 -0.3770436 0.03559924 0.11785667 0.0033670748 0.079977475 -0.460622 -0.922562 0.54822904 -0.7001525 -0.13735794 0.0046447627 0.93614495 -0.04533757 1.0877196 0.18663098 -0.33188298 -1.1195552 0.22625268 0.18178236 0.44003317 -0.035616595 -0.17230903 -0.39078838 0.09534323 -0.36450732 -0.13266148 -0.5948716 -0.3778122 0.115013696 -0.48863468 0.5276801 -0.10320456 0.17860238 0.5847855 -0.55870014 1.1700139 -0.8719531 0.2900501 -0.4467073 0.26552573 -0.36334535 0.0765188 -1.2428156 0.07730358 0.08907298 0.52686894 -0.43270507 -1.400375 0.107771374 -0.81395435 -0.24545032 -0.26216444 -0.32014206 -0.35348052 -0.024345992 0.53140754 0.08466306 -0.57038295 -0.1269843 0.58409613 0.46116874 0.94535094 0.025036573 0.057027116 -0.68037903 1.0046511 -1.2596852 -0.037459765 -0.389251 -0.21985579 0.53391653 0.55650496 0.3328932 1.0321438 -0.16949745 0.61743855 -0.06628016 -0.2838724 -0.72551495 -0.032637402 -0.5673327 -0.0897552 -0.84946555 -0.75218916 0.7547705 0.83145154 -0.26083234 0.14909117 0.11596523 0.15905048 0.7511518 1.3206866 -0.06821178 0.79532903 -0.25254253 0.28651667 0.2536638 1.0395417 0.092335254 1.2873124 0.08776725 -0.05958847 -0.41424736 0.11005009 0.8274726 -0.51250714 -0.09145787 0.27819672 0.8735276 -0.5256038 0.26121446 0.08272835 0.39796406 -0.025718834 0.50356233 0.21068689 0.30204117 -0.30575705 -0.20718881 0.56285316 -0.5681627 0.69479936 0.19411525 0.25880888 0.47330382 -0.3539255 0.31446198 0.05105017 -0.107441604 -0.19249792 -0.39843526 -0.2551087 0.22434184 -0.2916974 -0.43394834 0.9704601 0.05666099 -0.69681704 -0.116564095 -0.03787969 0.27423766 -0.19120161 -0.92002064 0.21582173 -1.139993 0.39552695 -0.43537337 -0.16907048 0.5157604 0.4224562 0.5610382 -0.08036005 -1.4928522 -0.13146974 0.49898425 -1.0245981 0.1626403 -0.38850468 -0.8772544 0.18778482 -0.97421217 -0.29288915 0.6725434 -0.69844306 -0.14755279 -0.1968449 -0.86375725 -0.33360827 -0.10161168 -0.49888122 -0.33912677 0.43528208 -0.42569768 -0.20765932 -0.5381073 1.4305749 1.0162153 0.14457884 0.5763004 0.97068405 0.39098093 -0.03216348 -0.15244858 0.40377033 -0.18645048 0.9399603 0.076710895 0.5312454 -0.26848876 -0.46861956 -0.27942383 -0.6348347 0.294985 -0.40342814 1.0414813 1.0504925 -0.00836426 -0.99118257 -0.45631418 -0.7005619 0.404519 -0.10713773 -0.07559447 0.5544991 0.3827246 -0.55512184 -0.33234987 0.7993359 -0.079852566 0.35297632 0.5477561 0.22683053 0.5069918 -0.13029772 0.36162373 -0.014001881 -0.11648651 -0.66647947 0.01226069 -0.7284193 0.48086953 0.006934624 0.22385629 0.08074516 0.29289985 0.61216664 -0.12032819 0.1659586 -0.2181752 0.15336005 0.4407084 -1.0953207 -0.9043968 0.21611574 -0.90479344 -0.73193157 -0.62168366 0.9956651 -0.090728715 0.3878589 0.38336518 0.2604782 -0.0650832 0.05577252 0.7666885 0.14315598 0.0359419 0.44156542 -0.15730822 -0.15735826 0.10081276 -0.45704198 0.3992815 1.0245506 1.4449844 0.50542 -0.88196254 0.62593013 -0.2081841 0.60960853 -0.66418105 0.8603846 0.61228853 -1.2286749 -0.20330366 -1.0320998 1.198905 -0.16238491 0.17897743 0.16847304 0.42968208 0.1755085 0.34175223 0.49665308 -0.40418386 0.5926915 -0.6081441 1.0003483 0.3905947 -0.30414084 -0.34114298 0.8547739 -0.4670201 -0.23203468 0.5805412 0.40133566 -0.94826126 -0.23078169 -0.28718835 0.1264745 -0.70524764 0.508715 -0.024303429 0.3079768 0.98509324 0.19859965 -0.2700488 -0.50697654 -0.1804381 -0.3221201 0.22992785 -0.11842905 0.2621886 0.17650005 0.1401335 0.5725611 0.14143167 0.015926411 -0.12371779 -0.61506104 -0.61483264 -0.570195 -0.13236725 -0.11800632 -0.10830958 0.025182672 0.8578056 0.977953 -0.0059525445 -0.39955533 1.127108 -0.4665609 -0.03740844 -0.94570136 -0.1651189 -0.7827557 0.369654 0.20145196 0.50588286 -0.6361171 -0.7590097 -0.21335843 -0.5173786 0.97785115 0.47440884 1.2242765 -1.0599612 0.49780983 1.008144 -0.33477965 0.5589736 0.9486828 -0.07865547 0.82441354 -0.28226215 0.01269538 0.22909257 -1.2406305 0.74198633 0.019226547 -0.033761285 -0.25049102 -0.27017456 0.5518724 -1.0744305 -0.90507793 -0.16111492 -0.5462715 -1.9933928 0.031789362 -1.4327815 0.055561084 0.5697889 -0.5664057 -0.6227874 0.21851781 -0.726629 -1.1050928 0.1555212 -0.13036552 1.5256817 0.0031437278 -0.34641874 -0.26029167 0.2586624 0.21606264 -0.5991851 -0.5353387 -0.013069849 0.12415337 -0.59378207 -0.05707953 1.0167447 -0.41405144 -0.2853063 0.39441592 0.62434036 -0.38296816 0.015720915 1.1869724 0.7920963 0.1103225 -0.19993234 0.867546 0.67698205 1.1679859 1.0601817 -0.32352704 0.22812766 0.99878913 0.14075853 0.22087446 0.38174963 0.63968056 -0.63889086 0.8546627 0.5452647 0.31812298 -0.11800851 -0.6306626 -0.6350914 0.5565482 -0.7874143 0.22039914 -0.5172571 0.3113776 -0.27728507 0.20026723 -1.2695498 0.043180633 0.98999727 -0.24016514 0.7123504 -0.40300757 -0.7502448 -0.8941951 -0.19905064 1.9472562 0.703084 -0.44553536 -1.5692897 -0.363004 -0.07558155 1.7863687 -0.22492197 -0.25773934 -1.5538926 -0.36908916 0.24482231 -1.495694 0.51339495 0.5043237 -0.4106086 1.9655912 0.34972793 -1.0941802 -0.744956 -1.571301 -0.38214844 0.24033594 0.37885264 0.867155 0.6672241 -0.01693214 1.1466063 -0.5114372 0.72631586 0.38685834 -0.00609982 0.918031 -1.0576688 0.68399566 -0.7276541 -1.6443924 -0.22547406 0.28392553 -0.3197943 0.4078551 -0.7731335 -0.32600537 -0.8067985 -0.23840523 0.3526526 0.0196101 0.25087988 -0.6417036 0.005255079 0.21949208 -0.12147077 0.062054902 -0.5454072 1.025671 0.38807088 0.83292055 0.14208733 0.10787519 -0.05181068 0.27549422 -0.87673503 0.29951787 0.4675076 0.7174594 -0.527458 -1.0612055 -0.73938656 0.10550579 0.28773528 -0.5872211 0.7858924 0.8159002 0.518082 -0.63988984 0.072944984 0.26428187 -0.8011928 0.85742646 -0.6546526 -0.93099636 -0.57665247 0.023779552 1.1399913 -0.06637773 0.40282077 -0.9426894 -0.6185797 -0.09437606 0.5359475 0.022806503 -1.2509018 -0.05353026 -0.18726254 1.3856194 0.25013503 -0.27004337 -0.8613362 -0.6058942 0.21644488 -0.020496178 -0.35646865 -0.06542515 -0.11639291 0.7153526 -0.1760036 0.7813124 0.93504244 -0.23096421 -0.1552721 -0.69693065 0.308117 -0.7010237 -0.28066248 -0.21433288 -0.67217493 0.7867059 0.068477064 -0.57168525 0.012380041 -0.17970753 0.31171468 -0.63663334 -0.023489561 0.22867082 -0.33117527 -0.32161456 -0.18029884 0.4430051 -0.15684946 -0.32500783 0.24891087 -0.37589657 0.1752151 -0.7131431 -0.11198734 -1.0265784 -0.82821333 -0.9937131 -0.04920406 0.2835452 -0.5676211 -0.593093 -0.410075 1.022616 1.6055924 -0.53110176 -0.6283989 -0.049254365 -0.97321147 -0.00038947538 0.519022 -0.894111 0.016800117 -0.5091581 -0.35818344 -0.55171865 -0.42846614 -0.10952275 0.4071202 -0.3670231 0.7691647 0.735392 0.28780562 0.5646238 -0.23212996 -0.32656664 -0.73763084 -0.32413647 -0.6763478 -0.29096603 -0.3797785 0.40527463 0.08826317 -0.26290894 0.8125853 -0.56574816 -0.5180119 0.33959463 0.27818117 -0.42889327 0.66216576 0.30071586 0.043642543 0.9566169 -0.7295776 -0.6970514 0.06682913 -0.11611781 1.3372544 -0.7711051 -0.27622965 0.07858875 -0.18716207 -0.21521975 0.21165168 -0.14572033 -0.23844214 0.20200655 1.3710401 0.6067855 1.481676 1.573426 0.60474557 0.40126243 0.3611929 -0.4031999 0.56728536 -0.026211482 0.3288062 -0.691287 -0.09511359 2.0640354 -0.35376358 -0.14619523 -1.1336256 -0.4286315 -0.53594714 0.095278636 -0.04674165 -0.5994138 0.7946129 -1.098087 0.3902552 -0.36271507 0.5038213 0.75229025 0.4611937 0.0022006333 0.41274896 0.63416564 -0.83857703 0.32325786 -0.11804989 -0.4368401 0.019128636 0.28285143 0.43789893 -0.13059512 0.7616387 -0.13585262 0.2664371 0.72596914 0.6382323 -0.37144414 0.5277119 0.35573763 0.1688681 1.1595916 0.0906278 -0.4178283 -1.0203297 -0.088457964 -0.2315415 -0.20515415 0.36526158 0.29821527 -0.736996 -0.4478651 1.1028807 -0.89644456 -0.41372925 -1.2328763 -0.4640182 0.5761474 -0.27844954 0.31586835 0.015641235 0.8092839 -0.9372387 0.9934972 -0.1745011 -1.0877256 0.4443961 0.6014369 0.077761345 -0.084602356 0.19059552 -0.35350552 -1.0678735 -1.0453316 0.27547592 0.9063827 0.06397402 0.18907769 -0.8156636 -0.9964015 -0.21612515 0.37872353 0.09812939 0.2623325 0.6650963 -0.5505926 -0.41475606 -0.15147823 0.3543966 -0.32942224 -0.5251814 0.075754255 0.40572733 1.5484574 0.44342157 0.40193808 0.4500907 0.6327993 0.33049652 -0.27509007 -0.40771475 0.59853494 -0.07888409 -0.615096 0.2346958 0.1482316 -0.6923686 -0.5850022 -0.27653936 -0.65077204 -0.36599004 0.5476607 0.026976332 1.5865514 0.5274796 0.6814955 -0.65799254 0.878574 -0.12011457 0.8211617 -0.77377725 1.2645822 0.12579253 -0.27432328 -0.54965115 -1.0023885 -0.79174185 -1.0093133 1.2638149 -1.506173 0.1553447 -0.109686226 -0.86125576 0.58364683 -0.011541486 -0.6712913 0.7731098 1.8426316 -1.6172205 0.34361455 1.1704623 0.79213065 -0.84654135 0.7066611 0.44231334 0.426677 0.6324037 -0.034689866 -0.46297157 0.5192616 -1.0597641 -0.26437494 0.97451335 -0.22925322 0.19796279 -1.3983903 -0.13544144 -0.47686645 0.11490105 0.14934665 1.0109042 -0.5283708 0.26576567 1.0581495 -1.4754385 3.4011955 0.9815066 0.5532428 0.8212732 -0.15424214 0.24511632 -0.05958248 0.27007103 -0.37704584 -0.9525652 -0.3379977 -0.11389114 0.35535258 0.40745777 -0.91808265 -0.050311718 -0.17617881 -0.64855754 -0.31649047 -0.36157423 -1.325951 -0.39904866 -0.022547163 -0.27759802 -0.02079327 1.000595 1.2254262 0.067013234 -0.4022824 -0.44132692 0.0016483366 -0.46476963 0.50223863 0.11289063 0.47185534 -0.11404542 -1.0319462 -0.5292204 0.569465 0.32238537 -0.33776748 -0.11816757 -0.16153826 -0.428957 -0.572226 0.95514035 -0.929618 0.050701976 -0.2523394 -0.5244093 -0.68299943 -0.018993558 0.14790796 -0.40893796 -1.3917955 0.32277173 -0.09575469 0.5825725 0.7617345 -0.09229246 0.18082699 0.36071482 0.1524885 -0.023177393 0.21314728 0.72768265 -0.019122198 1.0425646 -1.3852326 -0.40800434 0.5618238 -0.8775127 -0.02481873 -0.30698693 0.7367337 0.7375277 -0.5189366 0.30280325 0.9436593 0.83178425 0.38313845 0.665535 0.29839128 -0.9412593 -0.37495625 0.6025321 0.261773 -0.31901595 0.17819108 0.3722279 -0.45606178 0.507588 0.574316 -0.56879294 -0.49606207\n" ] } ], "source": [ "!head -2 \"$TE\"/embeddings.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the text embeddings in gensim" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "te_path = os.environ['TE'] + \"/text-embedding.w2v.txt\"\n", "te_vectors = KeyedVectors.load_word2vec_format(te_path, binary=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compare the graph and text embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most similar nodes to Johnnie Walker using the **graph embeddings**" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q48653710.833085298538208similarity'Bartlet for America'@en'episode of The West Wing (S3 E9)'@en
1Q70842790.8258047103881836similarity'Old Ironsides'@en'1926 film by James Cruze'@en
2Q77366020.8078582286834717similarity'The Girl of the Golden West'@en'1930 film by John Francis Dillon'@en
3Q17999480.8060345649719238similarity'Ladies of Leisure'@en'1930 film by Frank Capra'@en
4Q22883280.8006598949432373similarity'The Matinee Idol'@en'1928 film by Walt Disney, Frank Capra'@en
5Q6287370.7132620811462402similarity'Campbeltown Single Malts'@en'single malt Scotch whiskies distilled in the ...
6Q2800.6832661032676697similarity'Lagavulin Distillery'@en'Scotch whisky distillery in Lagavulin, Islay,...
7Q17611850.6419662237167358similarity'Pimm\\\\\\\\'s'@en'alcohol brand'@en
8Q962789790.6371052861213684similarity'Lagavulin 16 years whisky'@en'Lagavulin 16 years single malt scotch whisky'@en
\n", "
" ], "text/plain": [ " node1 node2 label \\\n", "0 Q4865371 0.833085298538208 similarity \n", "1 Q7084279 0.8258047103881836 similarity \n", "2 Q7736602 0.8078582286834717 similarity \n", "3 Q1799948 0.8060345649719238 similarity \n", "4 Q2288328 0.8006598949432373 similarity \n", "5 Q628737 0.7132620811462402 similarity \n", "6 Q280 0.6832661032676697 similarity \n", "7 Q1761185 0.6419662237167358 similarity \n", "8 Q96278979 0.6371052861213684 similarity \n", "\n", " node1;label \\\n", "0 'Bartlet for America'@en \n", "1 'Old Ironsides'@en \n", "2 'The Girl of the Golden West'@en \n", "3 'Ladies of Leisure'@en \n", "4 'The Matinee Idol'@en \n", "5 'Campbeltown Single Malts'@en \n", "6 'Lagavulin Distillery'@en \n", "7 'Pimm\\\\\\\\'s'@en \n", "8 'Lagavulin 16 years whisky'@en \n", "\n", " node1;description \n", "0 'episode of The West Wing (S3 E9)'@en \n", "1 '1926 film by James Cruze'@en \n", "2 '1930 film by John Francis Dillon'@en \n", "3 '1930 film by Frank Capra'@en \n", "4 '1928 film by Walt Disney, Frank Capra'@en \n", "5 'single malt Scotch whiskies distilled in the ... \n", "6 'Scotch whisky distillery in Lagavulin, Islay,... \n", "7 'alcohol brand'@en \n", "8 'Lagavulin 16 years single malt scotch whisky'@en " ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q502268 is Johnnie Walker\n", "kgtk_most_similar(ge_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most similar nodes to Johnnie Walker using the **text embeddings**" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q2800.9379171133041382similarity'Lagavulin Distillery'@en'Scotch whisky distillery in Lagavulin, Islay,...
1Q24900310.9346836805343628similarity'William Grant & Sons'@en'Scottish company which distills Scotch whisky...
2Q15436460.9012988805770874similarity'Rob Roy'@en'cocktail based on Scotch whisky'@en
3Q21685230.8907997012138367similarity'The Famous Grouse'@en'brand of Scotch whisky'@en
4Q10695020.8856703042984009similarity'Chivas Regal'@en'Blended Scotch Whisky produced by Chivas Brot...
5Q48218380.8762272596359253similarity'Aultmore distillery'@en'whisky distillery in Moray, Scotland, UK'@en
6Q47203190.8761684894561768similarity'Alexander Walker'@en'Scottish whisky distiller'@en
7Q17549780.8664095401763916similarity'Rusty Nail'@en'cocktail mixing Drambuie and Scotch whisky'@en
8Q420324780.8583760857582092similarity'Tiree Whisky Company'@en'company that sells whisky on the island of Ti...
9Q200314430.8488548994064331similarity'Something Special'@en'blended Scotch whisky'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q280 0.9379171133041382 similarity 'Lagavulin Distillery'@en \n", "1 Q2490031 0.9346836805343628 similarity 'William Grant & Sons'@en \n", "2 Q1543646 0.9012988805770874 similarity 'Rob Roy'@en \n", "3 Q2168523 0.8907997012138367 similarity 'The Famous Grouse'@en \n", "4 Q1069502 0.8856703042984009 similarity 'Chivas Regal'@en \n", "5 Q4821838 0.8762272596359253 similarity 'Aultmore distillery'@en \n", "6 Q4720319 0.8761684894561768 similarity 'Alexander Walker'@en \n", "7 Q1754978 0.8664095401763916 similarity 'Rusty Nail'@en \n", "8 Q42032478 0.8583760857582092 similarity 'Tiree Whisky Company'@en \n", "9 Q20031443 0.8488548994064331 similarity 'Something Special'@en \n", "\n", " node1;description \n", "0 'Scotch whisky distillery in Lagavulin, Islay,... \n", "1 'Scottish company which distills Scotch whisky... \n", "2 'cocktail based on Scotch whisky'@en \n", "3 'brand of Scotch whisky'@en \n", "4 'Blended Scotch Whisky produced by Chivas Brot... \n", "5 'whisky distillery in Moray, Scotland, UK'@en \n", "6 'Scottish whisky distiller'@en \n", "7 'cocktail mixing Drambuie and Scotch whisky'@en \n", "8 'company that sells whisky on the island of Ti... \n", "9 'blended Scotch whisky'@en " ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q502268 is Johnnie Walker\n", "kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graph embeddings produce poor results as the top matches are not related to whiskey. The text embeddings look much better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most similar nodes to Michelob using the **graph embeddings**" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q6106720.9267997741699219similarity'Budweiser'@en'brand of pale lager'@en
1Q487992340.7637178897857666similarity'Virginia Black Whiskey'@en'super-premium brand of American Bourbon whisk...
2Q852699760.762772262096405similarity'Busch Beer'@en'brand of beer owned by Anheuser-Busch'@en
3Q51493890.7565429210662842similarity'Colt 45'@en'malt liquor'@en
4Q30799900.752647340297699similarity'Four Loko'@en'Drink'@en
5Q969523630.7438719272613525similarity'Cronk'@en'American drink'@en
6Q70855330.7436875104904175similarity'Olde English 800'@en'malt liquor'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q610672 0.9267997741699219 similarity 'Budweiser'@en \n", "1 Q48799234 0.7637178897857666 similarity 'Virginia Black Whiskey'@en \n", "2 Q85269976 0.762772262096405 similarity 'Busch Beer'@en \n", "3 Q5149389 0.7565429210662842 similarity 'Colt 45'@en \n", "4 Q3079990 0.752647340297699 similarity 'Four Loko'@en \n", "5 Q96952363 0.7438719272613525 similarity 'Cronk'@en \n", "6 Q7085533 0.7436875104904175 similarity 'Olde English 800'@en \n", "\n", " node1;description \n", "0 'brand of pale lager'@en \n", "1 'super-premium brand of American Bourbon whisk... \n", "2 'brand of beer owned by Anheuser-Busch'@en \n", "3 'malt liquor'@en \n", "4 'Drink'@en \n", "5 'American drink'@en \n", "6 'malt liquor'@en " ] }, "execution_count": 152, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q15874936 is Michelob\n", "kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most similar nodes to Michelob using the **text embeddings**" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q20114730.9664472341537476similarity'Fantôme'@en'brand of beer'@en
1Q33155750.9586231708526611similarity'Bersalis'@en'beer brand'@en
2Q35185540.9563601016998291similarity'Floris'@en'beer brand'@en
3Q150760690.9531255960464478similarity'Marckloff'@en'beer brand'@en
4Q12773880.9511646628379822similarity'Pripps Blå'@en'beer brand'@en
5Q19172550.9475076794624329similarity'St-Idesbald'@en'beer'@en
6Q2639800.9443504810333252similarity'Soproni'@en'beer mark'@en
7Q33377820.9438232779502869similarity'Carrousel'@en'Beer'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q2011473 0.9664472341537476 similarity 'Fantôme'@en \n", "1 Q3315575 0.9586231708526611 similarity 'Bersalis'@en \n", "2 Q3518554 0.9563601016998291 similarity 'Floris'@en \n", "3 Q15076069 0.9531255960464478 similarity 'Marckloff'@en \n", "4 Q1277388 0.9511646628379822 similarity 'Pripps Blå'@en \n", "5 Q1917255 0.9475076794624329 similarity 'St-Idesbald'@en \n", "6 Q263980 0.9443504810333252 similarity 'Soproni'@en \n", "7 Q3337782 0.9438232779502869 similarity 'Carrousel'@en \n", "\n", " node1;description \n", "0 'brand of beer'@en \n", "1 'beer brand'@en \n", "2 'beer brand'@en \n", "3 'beer brand'@en \n", "4 'beer brand'@en \n", "5 'beer'@en \n", "6 'beer mark'@en \n", "7 'Beer'@en " ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q15874936 is Michelob\n", "kgtk_most_similar(te_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "THe graph embeddings contain some bad results, but the top matches are better as they include beers that are more closely related to Michelob. The text embeddings are reasonable as they include only beers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most similar nodes to vodka using the **graph embeddings**" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q205776880.8814862966537476similarity'.vodka'@en'top-level Internet domain'@en
1Q74680320.8503187894821167similarity'Vodka'@en'Detective Conan character'@en
2Q113280650.8384641408920288similarity'Balalaika'@en'Japanese short drink, cocktail'@en
3Q211897250.8248207569122314similarity'Red Eye Louie\\\\\\\\'s Vodquila'@en'blend of vodka and tequila'@en
4Q22065880.8186914920806885similarity'Caipiroska'@en'cocktail prepared with vodka'@en
5Q9204120.8170762062072754similarity'Belvédère'@en'French wine and spirits producer and distribu...
6Q71518010.8166672587394714similarity'Category:Vodkas'@en'Wikimedia category'@en
7Q237127040.8152912855148315similarity'EB-11 / Vodka'@en'encyclopedic article'@en
8Q15395250.8101651668548584similarity'Stolichnaya'@en'vodka brand'@en
\n", "
" ], "text/plain": [ " node1 node2 label \\\n", "0 Q20577688 0.8814862966537476 similarity \n", "1 Q7468032 0.8503187894821167 similarity \n", "2 Q11328065 0.8384641408920288 similarity \n", "3 Q21189725 0.8248207569122314 similarity \n", "4 Q2206588 0.8186914920806885 similarity \n", "5 Q920412 0.8170762062072754 similarity \n", "6 Q7151801 0.8166672587394714 similarity \n", "7 Q23712704 0.8152912855148315 similarity \n", "8 Q1539525 0.8101651668548584 similarity \n", "\n", " node1;label \\\n", "0 '.vodka'@en \n", "1 'Vodka'@en \n", "2 'Balalaika'@en \n", "3 'Red Eye Louie\\\\\\\\'s Vodquila'@en \n", "4 'Caipiroska'@en \n", "5 'Belvédère'@en \n", "6 'Category:Vodkas'@en \n", "7 'EB-11 / Vodka'@en \n", "8 'Stolichnaya'@en \n", "\n", " node1;description \n", "0 'top-level Internet domain'@en \n", "1 'Detective Conan character'@en \n", "2 'Japanese short drink, cocktail'@en \n", "3 'blend of vodka and tequila'@en \n", "4 'cocktail prepared with vodka'@en \n", "5 'French wine and spirits producer and distribu... \n", "6 'Wikimedia category'@en \n", "7 'encyclopedic article'@en \n", "8 'vodka brand'@en " ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q374 is vodka\n", "kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most similar nodes to vodka using the **text embeddings**" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q48692830.9598516225814819similarity'Batini'@en'vodka-based cocktail'@en
1Q35620460.9595369100570679similarity'Vodka Stinger'@en'type of cocktail'@en
2Q22065880.943680465221405similarity'Caipiroska'@en'cocktail prepared with vodka'@en
3Q222362380.9384630918502808similarity'Mariette'@en'vodka, alcohol'@en
4Q79393170.9203515648841858similarity'Vodka Cruiser'@en'brand of vodka-based alcoholic drink'@en
5Q118025650.9155371189117432similarity'Pan Tadeusz'@en'brand of vodka'@en
6Q2680570.9129104614257812similarity'cosmopolitan'@en'cocktail made with vodka'@en
7Q47826170.9107505679130554similarity'Aqua Velva'@en'vodka and gin based cocktail'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q4869283 0.9598516225814819 similarity 'Batini'@en \n", "1 Q3562046 0.9595369100570679 similarity 'Vodka Stinger'@en \n", "2 Q2206588 0.943680465221405 similarity 'Caipiroska'@en \n", "3 Q22236238 0.9384630918502808 similarity 'Mariette'@en \n", "4 Q7939317 0.9203515648841858 similarity 'Vodka Cruiser'@en \n", "5 Q11802565 0.9155371189117432 similarity 'Pan Tadeusz'@en \n", "6 Q268057 0.9129104614257812 similarity 'cosmopolitan'@en \n", "7 Q4782617 0.9107505679130554 similarity 'Aqua Velva'@en \n", "\n", " node1;description \n", "0 'vodka-based cocktail'@en \n", "1 'type of cocktail'@en \n", "2 'cocktail prepared with vodka'@en \n", "3 'vodka, alcohol'@en \n", "4 'brand of vodka-based alcoholic drink'@en \n", "5 'brand of vodka'@en \n", "6 'cocktail made with vodka'@en \n", "7 'vodka and gin based cocktail'@en " ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q374 is vodka\n", "kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graph embeddings are noisy as the top matches include nodes not related to vodka, the text embeddings look much better." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 211, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q96760.8613677024841309similarity'Isle of Man'@en'British Crown dependency'@en
1Q12630770.8335838317871094similarity'DAA'@en'company that owns and operates Dublin Airport...
2Q43686230.8250888586044312similarity'Category:Republic of Ireland'@en'Wikimedia category'@en
3Q1644210.8058757781982422similarity'Connacht'@en'province in Ireland'@en
4Q1847600.8017445802688599similarity'County Monaghan'@en'county in Ireland'@en
5Q1782830.7986090183258057similarity'County Limerick'@en'county in Ireland'@en
6Q1862200.7974875569343567similarity'County Longford'@en'county in Ireland'@en
7Q1845940.7974545359611511similarity'County Waterford'@en'county in Ireland'@en
8Q931950.793678879737854similarity'Ulster'@en'province in Ireland'@en
9Q1874020.788328230381012similarity'County Cavan'@en'county in Ireland'@en
\n", "
" ], "text/plain": [ " node1 node2 label \\\n", "0 Q9676 0.8613677024841309 similarity \n", "1 Q1263077 0.8335838317871094 similarity \n", "2 Q4368623 0.8250888586044312 similarity \n", "3 Q164421 0.8058757781982422 similarity \n", "4 Q184760 0.8017445802688599 similarity \n", "5 Q178283 0.7986090183258057 similarity \n", "6 Q186220 0.7974875569343567 similarity \n", "7 Q184594 0.7974545359611511 similarity \n", "8 Q93195 0.793678879737854 similarity \n", "9 Q187402 0.788328230381012 similarity \n", "\n", " node1;label \\\n", "0 'Isle of Man'@en \n", "1 'DAA'@en \n", "2 'Category:Republic of Ireland'@en \n", "3 'Connacht'@en \n", "4 'County Monaghan'@en \n", "5 'County Limerick'@en \n", "6 'County Longford'@en \n", "7 'County Waterford'@en \n", "8 'Ulster'@en \n", "9 'County Cavan'@en \n", "\n", " node1;description \n", "0 'British Crown dependency'@en \n", "1 'company that owns and operates Dublin Airport... \n", "2 'Wikimedia category'@en \n", "3 'province in Ireland'@en \n", "4 'county in Ireland'@en \n", "5 'county in Ireland'@en \n", "6 'county in Ireland'@en \n", "7 'county in Ireland'@en \n", "8 'province in Ireland'@en \n", "9 'county in Ireland'@en " ] }, "execution_count": 211, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q27 Ireland\n", "kgtk_most_similar(ge_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 210, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q1910.7959819436073303similarity'Estonia'@en'sovereign state in Northern Europe'@en
1Q370.7896063327789307similarity'Lithuania'@en'sovereign state in Northeastern Europe'@en
2Q340.7771986722946167similarity'Sweden'@en'sovereign state in Northern Europe'@en
3Q350.7717932462692261similarity'Denmark'@en'sovereign state and Scandinavian country in n...
4Q7566170.7578498125076294similarity'Kingdom of Denmark'@en'sovereign unitary state in Europe, the Arctic...
5Q330.7564055919647217similarity'Finland'@en'sovereign state in Northern Europe'@en
6Q169650190.7521861791610718similarity'North borough of Brescia'@en'one of 5 boroughs of Brescia'@en
7Q15265380.7520326972007751similarity'Reykjavík North'@en'one of the six constituencies (kjördæmi) of I...
8Q1890.7486690282821655similarity'Iceland'@en'sovereign state in Northern Europe, situated ...
9Q220.7369431257247925similarity'Scotland'@en'country in Northwest Europe, part of the Unit...
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q191 0.7959819436073303 similarity 'Estonia'@en \n", "1 Q37 0.7896063327789307 similarity 'Lithuania'@en \n", "2 Q34 0.7771986722946167 similarity 'Sweden'@en \n", "3 Q35 0.7717932462692261 similarity 'Denmark'@en \n", "4 Q756617 0.7578498125076294 similarity 'Kingdom of Denmark'@en \n", "5 Q33 0.7564055919647217 similarity 'Finland'@en \n", "6 Q16965019 0.7521861791610718 similarity 'North borough of Brescia'@en \n", "7 Q1526538 0.7520326972007751 similarity 'Reykjavík North'@en \n", "8 Q189 0.7486690282821655 similarity 'Iceland'@en \n", "9 Q22 0.7369431257247925 similarity 'Scotland'@en \n", "\n", " node1;description \n", "0 'sovereign state in Northern Europe'@en \n", "1 'sovereign state in Northeastern Europe'@en \n", "2 'sovereign state in Northern Europe'@en \n", "3 'sovereign state and Scandinavian country in n... \n", "4 'sovereign unitary state in Europe, the Arctic... \n", "5 'sovereign state in Northern Europe'@en \n", "6 'one of 5 boroughs of Brescia'@en \n", "7 'one of the six constituencies (kjördæmi) of I... \n", "8 'sovereign state in Northern Europe, situated ... \n", "9 'country in Northwest Europe, part of the Unit... " ] }, "execution_count": 210, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Q27 Ireland\n", "kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the embeddings in queries to the KG" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "# Q281 whiskey\n", "# Q282 wine\n", "# Q3246609 mixed drink\n", "# Q374 vodka\n", "# Q332378 is absolut" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the most similar nodes to **absolut**, the swedish vodka using the text embeddings and put it in a file" ] }, { "cell_type": "code", "execution_count": 320, "metadata": {}, "outputs": [], "source": [ "# Q332378 is absolut\n", "kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q332378.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 321, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q73125600.9494208097457886similarity'Renat'@en'Swedish vodka'@en
1Q4061570.9068878293037415similarity'bäsk'@en'Swedish style spiced liquor'@en
2Q10340350.8990318775177002similarity'Finlandia Vodka'@en'Finnish brand of vodka'@en
3Q3740.8908252716064453similarity'vodka'@en'distilled alcoholic beverage'@en
4Q25535690.8900324106216431similarity'Vodka Martini'@en'cocktail made with vodka and vermouth'@en
5Q22065880.8866583108901978similarity'Caipiroska'@en'cocktail prepared with vodka'@en
6Q2680570.8860777616500854similarity'cosmopolitan'@en'cocktail made with vodka'@en
7Q40217060.8785413503646851similarity'Xan'@en'Vodka from Goygol'@en
8Q48692830.8784171342849731similarity'Batini'@en'vodka-based cocktail'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q7312560 0.9494208097457886 similarity 'Renat'@en \n", "1 Q406157 0.9068878293037415 similarity 'bäsk'@en \n", "2 Q1034035 0.8990318775177002 similarity 'Finlandia Vodka'@en \n", "3 Q374 0.8908252716064453 similarity 'vodka'@en \n", "4 Q2553569 0.8900324106216431 similarity 'Vodka Martini'@en \n", "5 Q2206588 0.8866583108901978 similarity 'Caipiroska'@en \n", "6 Q268057 0.8860777616500854 similarity 'cosmopolitan'@en \n", "7 Q4021706 0.8785413503646851 similarity 'Xan'@en \n", "8 Q4869283 0.8784171342849731 similarity 'Batini'@en \n", "\n", " node1;description \n", "0 'Swedish vodka'@en \n", "1 'Swedish style spiced liquor'@en \n", "2 'Finnish brand of vodka'@en \n", "3 'distilled alcoholic beverage'@en \n", "4 'cocktail made with vodka and vermouth'@en \n", "5 'cocktail prepared with vodka'@en \n", "6 'cocktail made with vodka'@en \n", "7 'Vodka from Goygol'@en \n", "8 'vodka-based cocktail'@en " ] }, "execution_count": 321, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !head \"$TE\"/Q332378.sim.tsv\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose I have absolut vodka and I want to make a cocktail. I can use the KG graph of the most similar nodes to absolut, and search the KG for mixed drinks (`Q3246609`) that appear in the list of most similar nodes to absolut.\n", "\n", "Here are some drinks we can make with absolut vodka." ] }, { "cell_type": "code", "execution_count": 323, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2node1;labelnode1;descriptioningredientingredient label
0Q25535690.8900324106216431'Vodka Martini'@en'cocktail made with vodka and vermouth'@enQ1105343'cocktail glass'@en
1Q25535690.8900324106216431'Vodka Martini'@en'cocktail made with vodka and vermouth'@enQ1621080'olive'@en
2Q25535690.8900324106216431'Vodka Martini'@en'cocktail made with vodka and vermouth'@enQ26877166'lemon twist'@en
3Q25535690.8900324106216431'Vodka Martini'@en'cocktail made with vodka and vermouth'@enQ26877423'dry vermouth'@en
4Q25535690.8900324106216431'Vodka Martini'@en'cocktail made with vodka and vermouth'@enQ374'vodka'@en
5Q22065880.8866583108901978'Caipiroska'@en'cocktail prepared with vodka'@enQ374'vodka'@en
6Q19668830.8709859848022461'Yorsh'@en'Russian drink of beer and vodka'@enQ374'vodka'@en
7Q19668830.8709859848022461'Yorsh'@en'Russian drink of beer and vodka'@enQ44'beer'@en
8Q17230600.8683922290802002'Kamikaze'@en'cocktail of vodka, triple sec and lime juice'@enQ1105343'cocktail glass'@en
9Q17230600.8683922290802002'Kamikaze'@en'cocktail of vodka, triple sec and lime juice'@enQ3539556'triple sec'@en
10Q17230600.8683922290802002'Kamikaze'@en'cocktail of vodka, triple sec and lime juice'@enQ374'vodka'@en
11Q17230600.8683922290802002'Kamikaze'@en'cocktail of vodka, triple sec and lime juice'@enQ5361217'lime juice'@en
12Q55800530.8639324903488159'Golden Russian'@en'cocktail of vodka and Galliano'@enQ1331962'Galliano'@en
13Q55800530.8639324903488159'Golden Russian'@en'cocktail of vodka and Galliano'@enQ374'vodka'@en
14Q55800530.8639324903488159'Golden Russian'@en'cocktail of vodka and Galliano'@enQ5361217'lime juice'@en
15Q80321310.8580197095870972'Woo Woo'@en'alcoholic beverage made of vodka, peach schna...Q26877133'lime wedge'@en
16Q80321310.8580197095870972'Woo Woo'@en'alcoholic beverage made of vodka, peach schna...Q26879660'peach schnapps'@en
17Q80321310.8580197095870972'Woo Woo'@en'alcoholic beverage made of vodka, peach schna...Q374'vodka'@en
18Q80321310.8580197095870972'Woo Woo'@en'alcoholic beverage made of vodka, peach schna...Q4131010'Highball glass'@en
19Q80321310.8580197095870972'Woo Woo'@en'alcoholic beverage made of vodka, peach schna...Q865448'Cranberry juice'@en
\n", "
" ], "text/plain": [ " node1 node2 node1;label \\\n", "0 Q2553569 0.8900324106216431 'Vodka Martini'@en \n", "1 Q2553569 0.8900324106216431 'Vodka Martini'@en \n", "2 Q2553569 0.8900324106216431 'Vodka Martini'@en \n", "3 Q2553569 0.8900324106216431 'Vodka Martini'@en \n", "4 Q2553569 0.8900324106216431 'Vodka Martini'@en \n", "5 Q2206588 0.8866583108901978 'Caipiroska'@en \n", "6 Q1966883 0.8709859848022461 'Yorsh'@en \n", "7 Q1966883 0.8709859848022461 'Yorsh'@en \n", "8 Q1723060 0.8683922290802002 'Kamikaze'@en \n", "9 Q1723060 0.8683922290802002 'Kamikaze'@en \n", "10 Q1723060 0.8683922290802002 'Kamikaze'@en \n", "11 Q1723060 0.8683922290802002 'Kamikaze'@en \n", "12 Q5580053 0.8639324903488159 'Golden Russian'@en \n", "13 Q5580053 0.8639324903488159 'Golden Russian'@en \n", "14 Q5580053 0.8639324903488159 'Golden Russian'@en \n", "15 Q8032131 0.8580197095870972 'Woo Woo'@en \n", "16 Q8032131 0.8580197095870972 'Woo Woo'@en \n", "17 Q8032131 0.8580197095870972 'Woo Woo'@en \n", "18 Q8032131 0.8580197095870972 'Woo Woo'@en \n", "19 Q8032131 0.8580197095870972 'Woo Woo'@en \n", "\n", " node1;description ingredient \\\n", "0 'cocktail made with vodka and vermouth'@en Q1105343 \n", "1 'cocktail made with vodka and vermouth'@en Q1621080 \n", "2 'cocktail made with vodka and vermouth'@en Q26877166 \n", "3 'cocktail made with vodka and vermouth'@en Q26877423 \n", "4 'cocktail made with vodka and vermouth'@en Q374 \n", "5 'cocktail prepared with vodka'@en Q374 \n", "6 'Russian drink of beer and vodka'@en Q374 \n", "7 'Russian drink of beer and vodka'@en Q44 \n", "8 'cocktail of vodka, triple sec and lime juice'@en Q1105343 \n", "9 'cocktail of vodka, triple sec and lime juice'@en Q3539556 \n", "10 'cocktail of vodka, triple sec and lime juice'@en Q374 \n", "11 'cocktail of vodka, triple sec and lime juice'@en Q5361217 \n", "12 'cocktail of vodka and Galliano'@en Q1331962 \n", "13 'cocktail of vodka and Galliano'@en Q374 \n", "14 'cocktail of vodka and Galliano'@en Q5361217 \n", "15 'alcoholic beverage made of vodka, peach schna... Q26877133 \n", "16 'alcoholic beverage made of vodka, peach schna... Q26879660 \n", "17 'alcoholic beverage made of vodka, peach schna... Q374 \n", "18 'alcoholic beverage made of vodka, peach schna... Q4131010 \n", "19 'alcoholic beverage made of vodka, peach schna... Q865448 \n", "\n", " ingredient label \n", "0 'cocktail glass'@en \n", "1 'olive'@en \n", "2 'lemon twist'@en \n", "3 'dry vermouth'@en \n", "4 'vodka'@en \n", "5 'vodka'@en \n", "6 'vodka'@en \n", "7 'beer'@en \n", "8 'cocktail glass'@en \n", "9 'triple sec'@en \n", "10 'vodka'@en \n", "11 'lime juice'@en \n", "12 'Galliano'@en \n", "13 'vodka'@en \n", "14 'lime juice'@en \n", "15 'lime wedge'@en \n", "16 'peach schnapps'@en \n", "17 'vodka'@en \n", "18 'Highball glass'@en \n", "19 'Cranberry juice'@en " ] }, "execution_count": 323, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$TE\"/Q332378.sim.tsv -i \"$Q154CLAIMS\" -i \"$Q154LABEL\" \\\n", "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), \\\n", " claims: (n1)-[:P186]->(:Q374), claims: (n1)-[:P186]->(ingredient), label: (ingredient)-[]->(i_label)' \\\n", "--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description, \\\n", " ingredient as ingredient, i_label as `ingredient label`' \\\n", "--order-by 'cast(similarity, float) desc' \\\n", "--where 'class = \"Q3246609\"' \\\n", "--limit 20 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "code", "execution_count": 291, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2node1;labelnode1;description
0Q19668830.7984070181846619'Yorsh'@en'Russian drink of beer and vodka'@en
1Q22065880.7781851291656494'Caipiroska'@en'cocktail prepared with vodka'@en
2Q55800530.7759937047958374'Golden Russian'@en'cocktail of vodka and Galliano'@en
3Q25535690.7755716443061829'Vodka Martini'@en'cocktail made with vodka and vermouth'@en
4Q268830850.7711346745491028'Russian Spring Punch'@en'sparkling cocktail'@en
5Q4559140.7694578170776367'Vodka Red Bull'@en'alcoholic beverage'@en
6Q17230600.7578018307685852'Kamikaze'@en'cocktail of vodka, triple sec and lime juice'@en
7Q6213020.757564902305603'Appletini'@en'apple-flavored vodka cocktail'@en
8Q80321310.7451797723770142'Woo Woo'@en'alcoholic beverage made of vodka, peach schna...
9Q15070960.744042158126831'Moscow mule'@en'mule cocktail with vodka, ginger beer and lim...
\n", "
" ], "text/plain": [ " node1 node2 node1;label \\\n", "0 Q1966883 0.7984070181846619 'Yorsh'@en \n", "1 Q2206588 0.7781851291656494 'Caipiroska'@en \n", "2 Q5580053 0.7759937047958374 'Golden Russian'@en \n", "3 Q2553569 0.7755716443061829 'Vodka Martini'@en \n", "4 Q26883085 0.7711346745491028 'Russian Spring Punch'@en \n", "5 Q455914 0.7694578170776367 'Vodka Red Bull'@en \n", "6 Q1723060 0.7578018307685852 'Kamikaze'@en \n", "7 Q621302 0.757564902305603 'Appletini'@en \n", "8 Q8032131 0.7451797723770142 'Woo Woo'@en \n", "9 Q1507096 0.744042158126831 'Moscow mule'@en \n", "\n", " node1;description \n", "0 'Russian drink of beer and vodka'@en \n", "1 'cocktail prepared with vodka'@en \n", "2 'cocktail of vodka and Galliano'@en \n", "3 'cocktail made with vodka and vermouth'@en \n", "4 'sparkling cocktail'@en \n", "5 'alcoholic beverage'@en \n", "6 'cocktail of vodka, triple sec and lime juice'@en \n", "7 'apple-flavored vodka cocktail'@en \n", "8 'alcoholic beverage made of vodka, peach schna... \n", "9 'mule cocktail with vodka, ginger beer and lim... " ] }, "execution_count": 291, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$TE\"/Q332378.sim.tsv -i \"$Q154CLAIMS\" \\\n", "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), claims: (n1)-[:P186]->(:Q374)' \\\n", "--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description' \\\n", "--order-by 'cast(similarity, float) desc' \\\n", "--where 'class = \"Q3246609\"' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are good, lots of choices of cocktails. Note that the embeddings are able to generalize from a specific vodka to vodka in general. The example also illustrates that KGTK can use the results of queries to gensim within queries to the KG." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 195, "metadata": {}, "outputs": [], "source": [ "# Q332378 is absolut\n", "kgtk_most_similar(ge_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=2000, output_path=os.environ['GE'] + \"/Q332378.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['node1\\tnode2\\tlabel\\tnode1;label\\tnode1;description', \"Q3527971\\t0.4424980580806732\\tsimilarity\\t'Ti\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'Punch'@en\\t'cocktail'@en\", \"Q594392\\t0.38892069458961487\\tsimilarity\\t'B-52'@en\\t'cocktail of coffee liqueur, Irish cream and triple sec'@en\", \"Q7535970\\t0.37358343601226807\\tsimilarity\\t'Skittle Bomb'@en\\t'bomb shot cocktail'@en\", \"Q7209010\\t0.37143874168395996\\tsimilarity\\t'Polar Bear'@en\\t'mint chocolate cocktail'@en\", \"Q3309707\\t0.37052232027053833\\tsimilarity\\t'Hawaiian Punch'@en\\t'Fruit punch brand'@en\", \"Q12738893\\t0.3702288269996643\\tsimilarity\\t'Quentão'@en\\t'Brazilian hot drink made \\u200b\\u200bfrom cachaça and some spices'@en\", \"Q2935472\\t0.36788904666900635\\tsimilarity\\t'Campari Soda'@en\\t'pre-mixed drink made by Campari'@en\", \"Q70428\\t0.3663345277309418\\tsimilarity\\t'Karsk'@en\\t'Scandinavian cocktail'@en\", \"Q590793\\t0.3614485263824463\\tsimilarity\\t'Vesper'@en\\t'cocktail originally made of gin, vodka, and Kina Lillet'@en\"]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q35279710.4424980580806732similarity'Ti\\\\\\\\\\\\\\\\'Punch'@en'cocktail'@en
1Q5943920.38892069458961487similarity'B-52'@en'cocktail of coffee liqueur, Irish cream and t...
2Q75359700.37358343601226807similarity'Skittle Bomb'@en'bomb shot cocktail'@en
3Q72090100.37143874168395996similarity'Polar Bear'@en'mint chocolate cocktail'@en
4Q33097070.37052232027053833similarity'Hawaiian Punch'@en'Fruit punch brand'@en
5Q127388930.3702288269996643similarity'Quentão'@en'Brazilian hot drink made ​​from cachaça and s...
6Q29354720.36788904666900635similarity'Campari Soda'@en'pre-mixed drink made by Campari'@en
7Q704280.3663345277309418similarity'Karsk'@en'Scandinavian cocktail'@en
8Q5907930.3614485263824463similarity'Vesper'@en'cocktail originally made of gin, vodka, and K...
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q3527971 0.4424980580806732 similarity 'Ti\\\\\\\\\\\\\\\\'Punch'@en \n", "1 Q594392 0.38892069458961487 similarity 'B-52'@en \n", "2 Q7535970 0.37358343601226807 similarity 'Skittle Bomb'@en \n", "3 Q7209010 0.37143874168395996 similarity 'Polar Bear'@en \n", "4 Q3309707 0.37052232027053833 similarity 'Hawaiian Punch'@en \n", "5 Q12738893 0.3702288269996643 similarity 'Quentão'@en \n", "6 Q2935472 0.36788904666900635 similarity 'Campari Soda'@en \n", "7 Q70428 0.3663345277309418 similarity 'Karsk'@en \n", "8 Q590793 0.3614485263824463 similarity 'Vesper'@en \n", "\n", " node1;description \n", "0 'cocktail'@en \n", "1 'cocktail of coffee liqueur, Irish cream and t... \n", "2 'bomb shot cocktail'@en \n", "3 'mint chocolate cocktail'@en \n", "4 'Fruit punch brand'@en \n", "5 'Brazilian hot drink made ​​from cachaça and s... \n", "6 'pre-mixed drink made by Campari'@en \n", "7 'Scandinavian cocktail'@en \n", "8 'cocktail originally made of gin, vodka, and K... " ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$GE\"/Q332378.sim.tsv \\\n", "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \\\n", "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, n1.label, n1.description' \\\n", "--order-by 'cast(similarity, float) desc' \\\n", "--where 'class = \"Q3246609\"' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are poor as for the most part, the retrieved cocktails do not have vodka. Let's try the query with vodka instead of absolut vodka" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [], "source": [ "# Q374 vodka\n", "kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q374.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2labelnode1;labelnode1;description
0Q113280650.8384641408920288similarity'Balalaika'@en'Japanese short drink, cocktail'@en
1Q22065880.8186914920806885similarity'Caipiroska'@en'cocktail prepared with vodka'@en
2Q35620460.6592038869857788similarity'Vodka Stinger'@en'type of cocktail'@en
3Q19668830.5952204465866089similarity'Yorsh'@en'Russian drink of beer and vodka'@en
4Q54597450.5736489295959473similarity'flirtini'@en'cocktail containing vodka, champagne and pine...
5Q4559140.5721926093101501similarity'Vodka Red Bull'@en'alcoholic beverage'@en
6Q51035980.5712590217590332similarity'Chocolate Cake'@en'cocktail'@en
7Q268794800.5568693280220032similarity'Godmother'@en'cocktail'@en
8Q55800530.5458002090454102similarity'Golden Russian'@en'cocktail of vodka and Galliano'@en
9Q39005770.5457539558410645similarity'Pertini'@en'cocktail drink with honey'@en
\n", "
" ], "text/plain": [ " node1 node2 label node1;label \\\n", "0 Q11328065 0.8384641408920288 similarity 'Balalaika'@en \n", "1 Q2206588 0.8186914920806885 similarity 'Caipiroska'@en \n", "2 Q3562046 0.6592038869857788 similarity 'Vodka Stinger'@en \n", "3 Q1966883 0.5952204465866089 similarity 'Yorsh'@en \n", "4 Q5459745 0.5736489295959473 similarity 'flirtini'@en \n", "5 Q455914 0.5721926093101501 similarity 'Vodka Red Bull'@en \n", "6 Q5103598 0.5712590217590332 similarity 'Chocolate Cake'@en \n", "7 Q26879480 0.5568693280220032 similarity 'Godmother'@en \n", "8 Q5580053 0.5458002090454102 similarity 'Golden Russian'@en \n", "9 Q3900577 0.5457539558410645 similarity 'Pertini'@en \n", "\n", " node1;description \n", "0 'Japanese short drink, cocktail'@en \n", "1 'cocktail prepared with vodka'@en \n", "2 'type of cocktail'@en \n", "3 'Russian drink of beer and vodka'@en \n", "4 'cocktail containing vodka, champagne and pine... \n", "5 'alcoholic beverage'@en \n", "6 'cocktail'@en \n", "7 'cocktail'@en \n", "8 'cocktail of vodka and Galliano'@en \n", "9 'cocktail drink with honey'@en " ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$GE\"/Q374.sim.tsv \\\n", "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \\\n", "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, n1.label, n1.description' \\\n", "--order-by 'cast(similarity, float) desc' \\\n", "--where 'class = \"Q3246609\"' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are good. Somehow, the graph embeddings are able to rerieve the cocktails that have vodka, but cannot generalize from absolut vodka to vodka." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Produce files to load in the Google Embedding Projector\n", "We need two files:\n", "\n", "- a TSV file with the vectors\n", "- a TSV file with the metadata, in the same order as the vectors\n", "\n", "We don't want to load all the vectors in the projectors because it is too many to visualize. We will load only the following types:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "focus_types = {\n", " \"Q3246609\": \"mixed drink\",\n", " \"Q44\": \"beer\",\n", " \"Q282\": \"wine\",\n", " \"Q281\": \"whiskey\",\n", " \"Q374\": \"vodka\",\n", " \"Q6256\": \"country\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct a dictionary that maps every q-node in the KG to the set of all its superclasses. We will use this dictionary later to tag each q-node with one of the focus types. For every q-node we willtest if the focus type is in the set of all super-classes." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "classes_result = !$kypher_raw -i \"$ISA\" -i \"$Q154CLAIMS\" -i \"$TEMP\"/Q154.descendant.tsv -i \"$P279STAR\" \\\n", "--match 'isa: (n1)-[]->(c), P279: (c)-[]->(class), claims: ()-[]->(class), descendant: (n1)-[]->()' \\\n", "--return 'distinct n1 as qnode, class as class' \n", "\n", "class_dict = {}\n", "for r in classes_result[1:]:\n", " row = r.split(\"\\t\")\n", " qnode = row[0]\n", " isa = row[1]\n", " entry = class_dict.get(qnode)\n", " if entry is None:\n", " class_dict[qnode] = set()\n", " entry = class_dict[qnode]\n", " entry.add(isa)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'Q102205',\n", " 'Q1048607',\n", " 'Q11024',\n", " 'Q11028',\n", " 'Q11064354',\n", " 'Q111352',\n", " 'Q11435',\n", " 'Q1150070',\n", " 'Q1166770',\n", " 'Q11795009',\n", " 'Q1190554',\n", " 'Q1194058',\n", " 'Q12055130',\n", " 'Q124291',\n", " 'Q12767945',\n", " 'Q131257',\n", " 'Q13878858',\n", " 'Q1400881',\n", " 'Q1422299',\n", " 'Q14819853',\n", " 'Q14912053',\n", " 'Q154',\n", " 'Q15401930',\n", " 'Q1554231',\n", " 'Q1632297',\n", " 'Q16686448',\n", " 'Q16722960',\n", " 'Q167270',\n", " 'Q1681365',\n", " 'Q16887380',\n", " 'Q16889133',\n", " 'Q169336',\n", " 'Q1704572',\n", " 'Q174984',\n", " 'Q1786828',\n", " 'Q1865992',\n", " 'Q187931',\n", " 'Q1914636',\n", " 'Q20817253',\n", " 'Q20937557',\n", " 'Q2095',\n", " 'Q214609',\n", " 'Q2150504',\n", " 'Q2200417',\n", " 'Q22269697',\n", " 'Q22272508',\n", " 'Q22294683',\n", " 'Q22299433',\n", " 'Q22299483',\n", " 'Q223557',\n", " 'Q23009552',\n", " 'Q23009675',\n", " 'Q2424752',\n", " 'Q25481995',\n", " 'Q266328',\n", " 'Q26717101',\n", " 'Q26907166',\n", " 'Q2695280',\n", " 'Q27166344',\n", " 'Q281',\n", " 'Q2844972',\n", " 'Q28555911',\n", " 'Q28728771',\n", " 'Q28732711',\n", " 'Q28823',\n", " 'Q28877',\n", " 'Q28921572',\n", " 'Q2944660',\n", " 'Q29651519',\n", " 'Q2990593',\n", " 'Q2996394',\n", " 'Q31464082',\n", " 'Q3249551',\n", " 'Q337060',\n", " 'Q34394',\n", " 'Q3505845',\n", " 'Q35120',\n", " 'Q35758',\n", " 'Q3695082',\n", " 'Q382947',\n", " 'Q386724',\n", " 'Q40050',\n", " 'Q4026292',\n", " 'Q427581',\n", " 'Q42848',\n", " 'Q43460564',\n", " 'Q4406616',\n", " 'Q4437984',\n", " 'Q46737',\n", " 'Q478798',\n", " 'Q483247',\n", " 'Q488383',\n", " 'Q492',\n", " 'Q5127848',\n", " 'Q517596',\n", " 'Q52948',\n", " 'Q5371079',\n", " 'Q54989186',\n", " 'Q551997',\n", " 'Q56139',\n", " 'Q58415929',\n", " 'Q58416391',\n", " 'Q58778',\n", " 'Q6005984',\n", " 'Q6031064',\n", " 'Q64732777',\n", " 'Q6671777',\n", " 'Q7184903',\n", " 'Q781413',\n", " 'Q79529',\n", " 'Q80071',\n", " 'Q813912',\n", " 'Q8171',\n", " 'Q8205328',\n", " 'Q82799',\n", " 'Q837718',\n", " 'Q9081',\n", " 'Q921513',\n", " 'Q9332',\n", " 'Q937228',\n", " 'novalue'}" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class_dict['Q502268']" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "def focus_type(qnode):\n", " for t in focus_types.keys():\n", " classes = class_dict.get(qnode)\n", " if classes and t in classes:\n", " return focus_types[t]\n", " if qnode in country_qnodes:\n", " return \"country\"\n", " return \"other\"" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "# Doesn't work because partition didin't work and we don't have the derived.isa file\n", "country_qnodes = set()\n", "!$kypher -i \"$Q154ISA\" \\\n", "--match '(n1)-[]->(:Q6256)'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct `country_qnodes`, the set of all country qnodes" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "country_result = !$kypher_raw -i \"$ISA\" -i \"$P279STAR\" -i \"$Q154CLAIMS\" \\\n", "--match 'claims: (country)-[]->(), isa: (country)-[:isa]->(c), P279: (c)-[]->(:Q6256)' \\\n", "--return 'distinct country as country' \n", "\n", "country_qnodes = set()\n", "for r in country_result[1:]:\n", " country_qnodes.add(r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct `alcoholic_qnodes`, the set of all alcoholic beverage qnodes." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "alcoholic_qnodes = set()\n", "for line in open(os.environ[\"TEMP\"] + \"/Q154.descendant.tsv\", \"r\"):\n", " alcoholic_qnodes.add(line.split(\"\\t\")[0])" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "def build_embedding_projector_vectors(embeddings_path):\n", " input_path = embeddings_path + \"/embeddings.txt\"\n", " vectors_path = embeddings_path + \"/projector.vectors.tsv\"\n", " qnodes_path = embeddings_path + \"/projector.qnodes.tsv\"\n", "\n", " input_file = open(input_path, \"r\")\n", " vectors_file = open(vectors_path, \"w\")\n", " qnodes_file = open(qnodes_path, \"w\")\n", "\n", " qnodes_file.write(\"node1\\n\")\n", "\n", " with open(input_path, \"r\") as w2v_file:\n", " next(w2v_file)\n", " for line in w2v_file:\n", " items = line.split(\" \")\n", " qnode = items[0]\n", " if qnode in alcoholic_qnodes or qnode in country_qnodes:\n", " vectors_file.write(\"\\t\".join(items[1:]))\n", " qnodes_file.write(\"{}\\n\".format(qnode))\n", "\n", " input_file.close()\n", " vectors_file.close()\n", " qnodes_file.close()" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "build_embedding_projector_vectors(os.environ[\"GE\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\n", "Q3242283\n", "Q3866024\n", "Q1112057\n", "Q3866020\n", "Q1513599\n", "Q17329207\n", "Q16620320\n", "Q3895013\n", "Q4880027\n" ] } ], "source": [ "!head \"$GE\"/translation.projector.qnodes.tsv" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "def build_embedding_projector_metadata(embeddings_path):\n", " kg_path = os.environ[\"OUT\"] + \"/parts\"\n", " os.environ[\"_label_graph\"] = kg_path + \"/labels.en.tsv.gz\"\n", " os.environ[\"_description_graph\"] = kg_path + \"/descriptions.en.tsv.gz\"\n", " os.environ[\"_qnodes\"] = embeddings_path + \"/projector.qnodes.tsv\"\n", "\n", " #result = !$kypher_raw -i \"$_label_graph\" -i \"$_description_graph\" -i \"$_qnodes\" \\\n", " #--match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \\\n", " #--return 'distinct n1 as node1, lab as `node1;label`, des as `node1;description`' \n", " \n", " result = !$kypher_raw -i \"$_label_graph\" -i \"$_description_graph\" -i \"$_qnodes\" \\\n", " --match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab)' \\\n", " --return 'distinct n1 as node1, lab as `node1;label`'\n", " \n", " metadata_path = embeddings_path + \"/projector.metadata.tsv\"\n", " metadata_file = open(metadata_path, \"w\")\n", " metadata_file.write(\"tag\\tqnode\\ttype\\n\")\n", "\n", " qnode_dict = {}\n", " for line in result[1:]:\n", " items = line.split(\"\\t\")\n", " qnode = items[0]\n", " # qnode_dict[qnode] = \"{} ({})\".format(items[1], items[2])\n", " qnode_dict[qnode] = \"{}\".format(items[1])\n", "\n", " with open(os.environ[\"_qnodes\"]) as qnodes_file:\n", " next(qnodes_file)\n", " for line in qnodes_file:\n", " qnode = line[:-1]\n", " ftype = focus_type(qnode)\n", " tag = qnode_dict.get(qnode)\n", " if tag is None:\n", " tag = qnode\n", " tag = \"{} ({})\".format(qnode_dict.get(qnode), ftype)\n", " metadata_file.write(\"{}\\t{}\\t{}\\n\".format(tag, qnode, ftype))\n", "\n", " metadata_file.close()\n", " qnodes_file.close() " ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "build_embedding_projector_metadata(os.environ[\"GE\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check that the file sizes are correct, the metadata file has one more line as it as headers." ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2244 14157 116997 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.metadata.tsv\n", " 2243 224300 2805636 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.vectors.tsv\n", " 4487 238457 2922633 total\n" ] } ], "source": [ "!wc \"$GE\"/projector.metadata.tsv \"$GE\"/projector.vectors.tsv" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-0.695853055\t-0.072303891\t0.496231377\t-0.293976039\t0.193507940\t0.096196420\t0.043117594\t-0.580413938\t-0.423150927\t0.348393738\t-0.044707101\t0.447685152\t-0.251975268\t0.192745760\t-0.357472301\t0.204551399\t-0.013355692\t0.216426134\t-0.170541272\t-0.189649135\t-0.299910724\t0.295587122\t0.594068944\t-0.064507566\t0.261834234\t-0.458304882\t-0.426072240\t-0.082138501\t0.007850863\t-0.320901960\t0.727239370\t0.642546177\t-0.339439988\t0.260855168\t0.066383749\t0.018122014\t0.614691317\t-0.109721325\t-0.066969074\t-0.123010576\t0.231307715\t0.633326292\t0.570168674\t-0.550969541\t0.073210679\t-0.459269404\t0.093307532\t0.358197242\t0.623394549\t-0.309046119\t-0.467551976\t0.312151939\t-0.491982907\t0.400699556\t-0.383774340\t-0.446712554\t0.047239214\t0.598234832\t-0.471011013\t-0.039659370\t-0.254376531\t-0.012475031\t-0.207778856\t0.335359454\t0.302034408\t0.153741017\t0.902297437\t-0.261785030\t0.502385259\t-0.139487550\t0.090193652\t-0.114394628\t-0.246014833\t-0.570263982\t0.746979654\t0.009215424\t-0.472881168\t0.205686644\t-0.781571090\t0.133758202\t-0.197057635\t-0.022827761\t-0.097072124\t-0.930668116\t-0.564921737\t-0.811056256\t-0.459467322\t-0.352878183\t-0.494716078\t0.520463228\t0.076241963\t-0.020195168\t0.423226446\t0.302821845\t-0.207172275\t-0.163210511\t0.028312737\t0.138087898\t0.582748592\t0.285810173\n" ] } ], "source": [ "!head -1 \"$GE\"/projector.vectors.tsv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "build_embedding_projector_vectors(os.environ[\"TE\"])" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "build_embedding_projector_metadata(os.environ[\"TE\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 2782 14542 118309 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.metadata.tsv\n", " 2781 2847744 31710917 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.vectors.tsv\n", " 2782 2782 24800 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.qnodes.tsv\n", " 8345 2865068 31854026 total\n" ] } ], "source": [ "!wc \"$TE\"/projector.metadata.tsv \"$TE\"/projector.vectors.tsv \"$TE\"/projector.qnodes.tsv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "!$kgtk lexicalize -i $OUT/all.tsv.gz \\\n", "--label-properties label \\\n", "--isa-properties P31 P279 P452 P106 \\\n", "--description-properties description \\\n", "--property-value P186 P17 P127 P176 \\\n", "--has-properties \"\" \\\n", "--add-entity-labels-from-input True \\\n", "-o \"$TE\"/sentences.tsv " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [], "source": [ "# Q374 is vodka\n", "kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q374.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [], "source": [ "# Q502268 is Johnnie Walker\n", "kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q502268.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [], "source": [ "# Q332378 is absolut\n", "kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q332378.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [], "source": [ "# Q27 Ireland\n", "kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q27.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 201, "metadata": {}, "outputs": [], "source": [ "# Q29 Spain\n", "kgtk_most_similar(te_vectors, positive=['Q29'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q29.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [], "source": [ "# Q29 Spain, Q45 Portugal, Q142 France\n", "kgtk_most_similar(te_vectors, positive=['Q29', 'Q45', 'Q142'], kg_path=os.environ['OUT'] + \"/parts\", topn=2000, output_path=os.environ['TE'] + \"/Q29.Q45.Q142.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [], "source": [ "# Q33 Finland\n", "kgtk_most_similar(te_vectors, positive=['Q33'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['TE'] + \"/Q33.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'vectors' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Q502268 is Johnnie Walker\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mkgtk_most_similar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvectors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpositive\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Q502268'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkg_path\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menviron\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'OUT'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m\"/parts\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtopn\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1000\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_path\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menviron\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'GE'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m\"/Q502268.sim.tsv\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'vectors' is not defined" ] } ], "source": [ "# Q502268 is Johnnie Walker\n", "kgtk_most_similar(vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q502268.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [], "source": [ "# Q502268 is Johnnie Walker\n", "kgtk_most_similar(vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q502268.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [], "source": [ "# Q374 is vodka\n", "kgtk_most_similar(vectors, positive=['Q374'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q374.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [], "source": [ "# Q332378 is absolut\n", "kgtk_most_similar(vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q332378.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [], "source": [ "# Q27 Ireland\n", "kgtk_most_similar(vectors, positive=['Q27'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q27.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [], "source": [ "# Q29 Spain\n", "kgtk_most_similar(vectors, positive=['Q29'], kg_path=os.environ['OUT'] + \"/parts\", topn=1000, output_path=os.environ['GE'] + \"/Q29.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [], "source": [ "# Q29 Spain, Q45 Portugal, Q142 France\n", "kgtk_most_similar(vectors, positive=['Q29', 'Q45', 'Q142'], kg_path=os.environ['OUT'] + \"/parts\", topn=2000, output_path=os.environ['GE'] + \"/Q29.Q45.Q142.sim.tsv\")" ] }, { "cell_type": "code", "execution_count": 211, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.51 real 0.38 user 0.11 sys\n", "node1 node2 label node1;label node1;description\n", "Q3527971 0.4424980580806732 similarity 'Ti\\\\\\\\\\\\\\\\'Punch'@en 'cocktail'@en\n", "Q594392 0.38892069458961487 similarity 'B-52'@en 'cocktail of coffee liqueur, Irish cream and triple sec'@en\n" ] } ], "source": [ "# Q281 whiskey\n", "# Q282 wine\n", "# Q3246609 mixed drink\n", "# Q374 vodka\n", "# Q332378 is absolut\n", "!$kypher -i \"$ISA\" -i \"$P279STAR\" -i \"$GE\"/Q332378.sim.tsv \\\n", "--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \\\n", "--return 'distinct n1 as node1, similarity as node2, \"similarity\" as label, n1.label, n1.description' \\\n", "--order-by 'cast(similarity, float) desc' \\\n", "--where 'class = \"Q3246609\"' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "lines = !kgtk remove-columns -i \"$Q154LABEL\" --all-except --columns node1 node2 \n", "label_dict = {}\n", "for line in lines[1:]:\n", " items = line.split(\"\\t\")\n", " label_dict[items[0]] = items[1]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "lines = !kgtk remove-columns -i \"$Q154DESCRIPTION\" --all-except --columns node1 node2 \n", "description_dict = {}\n", "for line in lines[1:]:\n", " items = line.split(\"\\t\")\n", " description_dict[items[0]] = items[1]" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "def show_labels(similar_list):\n", " result = []\n", " for x in similar_list:\n", " text = \"{}, {} ({}), {}\".format(label_dict.get(x[0]), description_dict.get(x[0]), x[0], x[1])\n", " result.append((text))\n", " return result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Stop here: the stuff below is Pedro's scratchpad, will be deleted later" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleanup\n", "\n", "Remove `novalue` and `somevalue`" ] } ], "metadata": { "kernelspec": { "display_name": "kgtk", "language": "python", "name": "kgtk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }