# KGTK Tutorial

Beer sites:
- https://www.realbeer.com/edu/health/calories.php
- http://getdrunknotfat.com/alcohol-content-of-beer/

In [4]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

In [5]:
kgtk_environment_variables

['ALIAS',
 'ALL',
 'CLAIMS',
 'DESCRIPTION',
 'EXAMPLES_DIR',
 'GE',
 'ISA',
 'ITEM',
 'LABEL',
 'OUT',
 'P279',
 'P279STAR',
 'PROPERTY_DATATYPES',
 'Q154ALIAS',
 'Q154ALL',
 'Q154CLAIMS',
 'Q154DESCRIPTION',
 'Q154ISA',
 'Q154ITEM',
 'Q154LABEL',
 'Q154P279',
 'Q154P279STAR',
 'Q154PROPERTY_DATATYPES',
 'Q154QUALIFIERS',
 'Q154QUALIFIERS_TIME',
 'Q154SITELINKS',
 'QUALIFIERS',
 'QUALIFIERS_TIME',
 'SITELINKS',
 'STORE',
 'TE',
 'TEMP',
 'WIKIDATA',
 'kgtk',
 'kypher']

In [6]:
%cd {output_path}

/Users/pedroszekely/Downloads/kypher


In [7]:
!mkdir {output_folder}
!mkdir {temp_folder}

mkdir: wikidata_os_v5: File exists
mkdir: temp.wikidata_os_v5: File exists


In [8]:
!mkdir "$GE"

mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding: File exists


In [9]:
!mkdir "$TE"

mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding: File exists


# Wikidata in KGTK
KGTK has the ability to import a Wikidata JSON dump and covert it to the KGTK representation to make it easy to process the full Wikidata KG in a laptop. There are 86 files which include all the information available in the Wikidata dump and files containing commonly used information derived from the dump. We partitioned the files because in most use cases you only need to use a subset of the files.

The files are very large. `claims.tsv` (23GB compressed) contains all the statements in the Wikidata dump, `qualifiers.tsv` contains the qualifiers of those edges, and `labels.en.tsv`, `aliases.en.tsv` and `descriptions.en.tsv` contain the English labels, aliases and descriptions.

In [11]:
!ls -lh "$CLAIMS" "$QUALIFIERS" "$LABEL" "$ALIAS" "$DESCRIPTION"

-rw-r--r--  1 pedroszekely  staff    68M Nov 16 08:07 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/aliases.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   4.7G Nov 16 08:05 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.tsv.gz
-rw-r--r--  1 pedroszekely  staff   269M Nov 16 08:08 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/descriptions.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   376M Nov 16 08:06 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/labels.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   662M Nov 16 08:43 /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/qualifiers.tsv.gz


`claims.tsv` contains many edges:

In [7]:
!time zcat < "$CLAIMS" | wc

 254135077 1578463882 20285305033

real	1m15.857s
user	2m7.309s
sys	0m8.130s


# KGTK Data Model
The KGTK data model is a generalization of RDF and property graphs, inspired by the Wikidata data model. In KGTK, a KG is represented using TSV files with four columns: three columns to store the subject, predicate and object of a triple, and a fourth column to store an identifier for the triple. By convention, we use the heading `id` for the identifier, `node1` for the subject, `node2` for the object and `label` for the predicate, as it labels the edge between `node1` and `node2`. The order of the columns is arbitrary.

All KGTK files must include the required `id`, `node1`, `label` and `node2` columns, and can contain additional columns to store addtional information about an edge or the nodes in the edge. We will explain the details after we discuss *qualifiers*.
Let's take a look at the first few lines of the `claims.tsv` file. We see the four required columns and two additional columns that the Wikidata import includes to facilitate processing of the `claims` file using custom scripts. The `rank` column records the Wikidata rank of a statement, and the `node2;wikidatatype` records the Wikidata type of the value in the `node2` column.

## Claims

In [8]:
!zcat < "$CLAIMS" | head | column -t -s $'\t'

id                              node1  label  node2                                    rank    node2;wikidatatype
P10-P1628-32b85d-7927ece6-0     P10    P1628  "http://www.w3.org/2006/vcard/ns#Video"  normal  url
P10-P1628-acf60d-b8950832-0     P10    P1628  "https://schema.org/video"               normal  url
P10-P1629-Q34508-bcc39400-0     P10    P1629  Q34508                                   normal  wikibase-item
P10-P1659-P1651-c4068028-0      P10    P1659  P1651                                    normal  wikibase-property
P10-P1659-P18-5e4b9c4f-0        P10    P1659  P18                                      normal  wikibase-property
P10-P1659-P4238-d21d1ac0-0      P10    P1659  P4238                                    normal  wikibase-property
P10-P1659-P51-86aca4c5-0        P10    P1659  P51                                      normal  wikibase-property
P10-P1855-Q15075950-7eff6d65-0  P10    P1855  Q15075950                                normal  wikibase-item
P10-P1855-Q6906365

Wikidata uses numbers to identify items and properties. We can use the `wd` utility (https://github.com/maxlath/wikibase-cli) to understand the first few lines. The second line states that the `P10` property in Wikidata has an equivalent property in another ontology. Notice that each edge has a distinct id. These ids are unique identifiers for statements (the format of the id can be arbitrary, but we assigned ids so that sorting files by id arranges the information so that all edges about a subject are consecutive.

In [9]:
!wd u P10 P1628 P1629

/usr/local/lib/node_modules/wikibase-cli/lib/entity_data_parser.js:6
module.exports = async params => {
                       ^^^^^^

SyntaxError: Unexpected identifier
    at createScript (vm.js:56:10)
    at Object.runInThisContext (vm.js:97:10)
    at Module._compile (module.js:549:28)
    at Object.Module._extensions..js (module.js:586:10)
    at Module.load (module.js:494:32)
    at tryModuleLoad (module.js:453:12)
    at Function.Module._load (module.js:445:3)
    at Module.require (module.js:504:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/usr/local/lib/node_modules/wikibase-cli/bin/wb-summary:2:26)


Let's look at a more meaningful example. `Q31` (https://www.wikidata.org/wiki/Q31) is the Wikidata item about Belgium. We will use the KGTK query to fetch edges about Belgium. `$kypher` is a shortcut to the `kgtk query` command where in addition we pass in the location of the SQLite database we are using ot store the files. KGTK queries use Cypher syntax (https://neo4j.com/developer/cypher/): the following simple query retrieves 10 edges where `node1` is `Q31`, the q-node for Belgium. The results include an edge with `label` `P1036` (Dewey Decimal Classification) and several edges with label `P1081` (human development index).

In [262]:
result = !$kypher_raw -i "$CLAIMS" \
--match '(:Q31)-[]-()' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,rank,node2;wikidatatype
0,Q31-P1036-c4e1ad-df86eeb8-0,Q31,P1036,"""2--493""",normal,external-id
1,Q31-P1081-02c2ed-033524b0-0,Q31,P1081,+0.866,normal,quantity
2,Q31-P1081-02c2ed-7971505b-0,Q31,P1081,+0.866,normal,quantity
3,Q31-P1081-068470-c1c63b8d-0,Q31,P1081,+0.889,normal,quantity
4,Q31-P1081-068470-ddac01e0-0,Q31,P1081,+0.889,normal,quantity
5,Q31-P1081-144738-c1851cdc-0,Q31,P1081,+0.905,normal,quantity
6,Q31-P1081-175742-c07ac1c8-0,Q31,P1081,+0.888,normal,quantity
7,Q31-P1081-19636d-c08dd8a8-0,Q31,P1081,+0.896,normal,quantity
8,Q31-P1081-1efc03-433a7a4d-0,Q31,P1081,+0.913,normal,quantity
9,Q31-P1081-1f8602-ddac530d-0,Q31,P1081,+0.852,normal,quantity


The output of the command above is hard to read because we are seeing the numeric Wikidata identifiers. To make the output more readable, we need to look up the labels of the Wikidata nodes. This information is in the `labels.en.tsv` file.

In [11]:
!zcat < "$LABEL" | head | column -t -s $'\t'

zcat: id              node1  label  node2
P10-label-en    P10    label  'video'@en
P1000-label-en  P1000  label  'record held'@en
P1001-label-en  P1001  label  'applies to jurisdiction'@en
P1002-label-en  P1002  label  'engine configuration'@en
error writing to outputP1003-label-en  P1003  label  'National Library of Romania ID'@en
: P1004-label-en  P1004  label  'MusicBrainz place ID'@en
Broken pipe
P1005-label-en  P1005  label  'Portuguese National Library ID'@en
P1006-label-en  P1006  label  'Nationale Thesaurus voor Auteurs ID'@en
P1007-label-en  P1007  label  'Lattes Platform number'@en


With KGTK accepts multiple files as input, and can do a join to retrieve the label for each property. When using multiple files, it is necessary to tag each clause with the file that provides the data for the clause. For example, the first clause is tagged with `claim` as the word `claim` is part of the file name. The variable property is used to connect the two clauses.

In [12]:
!$kypher -i "$CLAIMS" -i "$LABEL" \
--match 'claim: (n1:Q31)-[l {label: property}]-(n2), label: (property)-[:label]->(property_label)' \
--return 'l as id, n1 as node1, property as label, n2 as node2, property_label as `label;label`' \
--limit 10 \
| column -t -s $'\t'

        0.90 real         0.77 user         0.11 sys
id                           node1  label  node2     label;label
Q31-P1036-c4e1ad-df86eeb8-0  Q31    P1036  "2--493"  'Dewey Decimal Classification'@en
Q31-P1081-02c2ed-033524b0-0  Q31    P1081  +0.866    'Human Development Index'@en
Q31-P1081-02c2ed-7971505b-0  Q31    P1081  +0.866    'Human Development Index'@en
Q31-P1081-068470-c1c63b8d-0  Q31    P1081  +0.889    'Human Development Index'@en
Q31-P1081-068470-ddac01e0-0  Q31    P1081  +0.889    'Human Development Index'@en
Q31-P1081-144738-c1851cdc-0  Q31    P1081  +0.905    'Human Development Index'@en
Q31-P1081-175742-c07ac1c8-0  Q31    P1081  +0.888    'Human Development Index'@en
Q31-P1081-19636d-c08dd8a8-0  Q31    P1081  +0.896    'Human Development Index'@en
Q31-P1081-1efc03-433a7a4d-0  Q31    P1081  +0.913    'Human Development Index'@en
Q31-P1081-1f8602-ddac530d-0  Q31    P1081  +0.852    'Human Development Index'@en


Let's look at a the heads of state of Belgium recorded in property `P35`

In [13]:
!$kypher -i "$CLAIMS" -i "$LABEL" \
--match 'claims: (n1:Q31)-[l:P35]->(n2), labels: (n2)-[:label]->(n2_label)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2, n2_label as `node2;label`' \
--limit 10 \
| column -t -s $'\t'

        0.86 real         0.74 user         0.10 sys
id                            node1  label  node2      node2;label
Q31-P35-Q1079522-c82ed584-0   Q31    P35    Q1079522   'Erasme Louis Surlet de Chokier'@en
Q31-P35-Q12967-f2b9aaf3-0     Q31    P35    Q12967     'Leopold II of Belgium'@en
Q31-P35-Q12971-2088471b-0     Q31    P35    Q12971     'Leopold I of Belgium'@en
Q31-P35-Q12973-31c1b700-0     Q31    P35    Q12973     'Leopold III of Belgium'@en
Q31-P35-Q12976-f3e8a567-0     Q31    P35    Q12976     'Baudouin I of Belgium'@en
Q31-P35-Q155004-619ba603-0    Q31    P35    Q155004    'Philippe I of Belgium'@en
Q31-P35-Q3911-137f01fe-0      Q31    P35    Q3911      'Albert II of Belgium'@en
Q31-P35-Q445553-7599749f-0    Q31    P35    Q445553    'Prince Charles, Count of Flanders'@en
Q31-P35-Q55008046-725dce40-0  Q31    P35    Q55008046  'Albert I of Belgium'@en


## Qualifiers
Qualifiers provide additional information about the claims stated in the edges. For `P1081` the qualifiers tell use the year, and for head of state the qualifiers provide information about the period of time and position held by the head of state. The qualifiers can be retrieved using the identifiers of the edges. Let's retrieve the qualifiers associated with the edge for the first head of state (Erasme Louis). To do so, we use the identifier of the edge (`Q31-P35-Q1079522-c82ed584-0`) as `node1` in the `qualifiers.tsv` file. We get three edges, meaning that the edge `Q31/P35/Q1079522` has three qualifiers. Note that the qualifier edges are the same as any other edge in KGTK, having `id`, `node1`, `label` and `node2` columns:

In [14]:
!$kypher -i "$QUALIFIERS" \
--match '(n1:`Q31-P35-Q1079522-c82ed584-0`)-[l]->(n2)' \
--limit 10 \
| column -t -s $'\t'

        0.90 real         0.77 user         0.11 sys
id                                         node1                        label  node2                     node2;wikidatatype
Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0  Q31-P35-Q1079522-c82ed584-0  P39    Q477406                   wikibase-item
Q31-P35-Q1079522-c82ed584-0-P580-106076-0  Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11  time
Q31-P35-Q1079522-c82ed584-0-P582-774519-0  Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11  time


Let's make them readable: the following query combines the patterns of the previous two queries to retrieve the labels of the property and node2. The query omits the identifier of the qualifier edges to save space. Also, the headers of the two additional columns can be arbitrary, i.e., you can name them whatever you want; the names used follow a KGTK convention that enabled KGTK to automatically parse the output, which is useful if we want to use the output as an input to another KGTK command. The word before the `;` refers to one of the standard columns, and the name after the `;` refers to a property of that element. In this example, we used `label` as the column contains the label of the entity.

In [15]:
!$kypher -i "$QUALIFIERS" -i "$LABEL" \
--match 'qual: (n1:`Q31-P35-Q1079522-c82ed584-0`)-[l {label: property}]->(n2), labels: (property)-[:label]->(property_label)' \
--return 'n1 as node1, property as label, n2 as node2, property_label as `label;label`' \
--limit 10 \
| column -t -s $'\t'

        0.90 real         0.77 user         0.11 sys
node1                        label  node2                     label;label
Q31-P35-Q1079522-c82ed584-0  P39    Q477406                   'position held'@en
Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11  'start time'@en
Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11  'end time'@en


Let's put all the values of `P35` in a file, which we will conveniently name `Q31.P35.tsv`

In [16]:
!$kypher -i "$CLAIMS" \
--match '(n1:Q31)-[l:P35]->(n2)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q31.P35.tsv

        0.83 real         0.71 user         0.09 sys


Now we are going to combine the `P35` edges of Belgium with the qualifiers. To do this we will run a query that uses the edges that we stored in `Q31.P35.tsv`, and retrieve the qualifiers for each of those edges; the result of our query will be the qualifier edges of the head of state edges. To union the qualifier edges with the claim edges, we feed the output of the query to the `cat` command (concatenate), and then feed the output to the `sort2` command to sort the edges. The first 12 edges are shown below. We see a claim edge followed by the qualifiers defined for it.

This snippet illustrates that KGTK commands can be chained using the `/` chain operator to compose more complex workflows.

In [17]:
!$kypher -i "$QUALIFIERS" -i "$TEMP"/Q31.P35.tsv \
--match 'P35: ()-[l]->(), qual: (l)-[lq]->(n2)' \
--return 'lq as id, l as node1, lq.label as label, n2 as node2' \
/ cat -i - -i "$TEMP"/Q31.P35.tsv \
/ sort2 \
| head -12 \
| column -t -s $'\t'

id                                         node1                        label  node2
Q31-P35-Q1079522-c82ed584-0                Q31                          P35    Q1079522
Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0  Q31-P35-Q1079522-c82ed584-0  P39    Q477406
Q31-P35-Q1079522-c82ed584-0-P580-106076-0  Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11
Q31-P35-Q1079522-c82ed584-0-P582-774519-0  Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11
Q31-P35-Q12967-f2b9aaf3-0                  Q31                          P35    Q12967
Q31-P35-Q12967-f2b9aaf3-0-P39-Q13592862-0  Q31-P35-Q12967-f2b9aaf3-0    P39    Q13592862
Q31-P35-Q12967-f2b9aaf3-0-P580-f29037-0    Q31-P35-Q12967-f2b9aaf3-0    P580   ^1865-12-17T00:00:00Z/11
Q31-P35-Q12967-f2b9aaf3-0-P582-136f02-0    Q31-P35-Q12967-f2b9aaf3-0    P582   ^1909-12-17T00:00:00Z/11
Q31-P35-Q12971-2088471b-0                  Q31                          P35    Q12971
Q31-P35-Q12971-2088471b-0-P39-Q13592862-0  Q31-P35-Q12971-20884

## Summary

- KGTK represents graphs in TSV files with standard columns `id`, `node1`, `label` and `node2`
- It is possible to include arbitrary additional columns in KGTK files
- The identifier of an edge can be used as a node in another edge enabling the representation of edges about edges
- KGTK provides a powerful query command based on Cypher as well as a host of other commands, type `kgtk --help` to see the list of commands.

# Use Case: A Knowledge Graph About Alocholic Beverages
We are going to build a small KG about alcoholoc beverages by extracting from Wikidata the subgraph that relates to alcoholic beverages (https://www.wikidata.org/wiki/Q154)

### Step 1: create a list of all descendants of `alcoholic beverage` (https://www.wikidata.org/wiki/Q154)

In [18]:
!wd u Q154

/usr/local/lib/node_modules/wikibase-cli/lib/entity_data_parser.js:6
module.exports = async params => {
                       ^^^^^^

SyntaxError: Unexpected identifier
    at createScript (vm.js:56:10)
    at Object.runInThisContext (vm.js:97:10)
    at Module._compile (module.js:549:28)
    at Object.Module._extensions..js (module.js:586:10)
    at Module.load (module.js:494:32)
    at tryModuleLoad (module.js:453:12)
    at Function.Module._load (module.js:445:3)
    at Module.require (module.js:504:17)
    at require (internal/module.js:20:19)
    at Object.<anonymous> (/usr/local/lib/node_modules/wikibase-cli/bin/wb-summary:2:26)


Wikidata uses two properties to organize entities in a hierarchy: the `instance of` property (`P31`) and the `subclass of` (`P279`) property. In many cases, the distinction between instance of and subclass of is subtle, and we find many situations in Wikidata where either one or the other is used to organize hierarchies. For this reason, we created a new property called `isa` that contains the union of `P31` and `P279` and stored in the file `derived.isa.tsv`

In [19]:
!zcat < "$ISA" | head -5

node1	label	node2
P10	isa	Q18610173
P1000	isa	Q18608871
P1001	isa	Q15720608
P1001	isa	Q22984026
zcat: error writing to output: Broken pipe


To get all the alcoholic beverages, we need to get all entities that are `isa` of alcoholic beverage (`Q154`) or that are `isa` of any descendant of `Q154` in the `subclass of` (`P279`) hierarchy. The length of the chain of `P279` edges can be arbitrarily long. To support this uise case, KGTK offers the `derived.P279star.tsv` file that contains edges `n1/P279star/n2` if `n1` is a descendant of `n2` on chains of `P279` edges, includiing chains of zero length (`n1/P279star/n1`).

In [20]:
!zcat < "$P279STAR" | head -5 | column -t -s $'\t'

zcat: node1     label     node2     id
Q1000032  P279star  Q1000032  Q1000032-P279star-Q1000032-0000
Q1000032  P279star  Q1150070  Q1000032-P279star-Q1150070-0000
Q1000032  P279star  Q1190554  Q1000032-P279star-Q1190554-0000
Q1000032  P279star  Q133500   Q1000032-P279star-Q133500-0000
error writing to output: Broken pipe


To get all alcoholic beverages, we need to find all nodes `n1` that are connected to `Q154` with an `isa` edge and a chain of `P279` edges:

In [21]:
!$kypher -i "$ISA" -i "$P279STAR" -i "$LABEL" \
--match 'isa: (n1)-[]->(n2), star: (n2)-[]->(n3:Q154), label: (n1)-[]->(n1l)' \
--return 'n1 as node1, n1l as `node1;label`, n3 as node2, "isastar" as label' \
-o "$TEMP"/Q154.descendant.tsv

        3.18 real         0.93 user         0.57 sys


Here is a sample of alcoholic beverages in Wikidata

In [22]:
!head "$TEMP"/Q154.descendant.tsv | column -t -s $'\t'

node1      node1;label                  node2  label
Q1350656   'Corn whiskey'@en            Q154   isastar
Q20713240  'Buckwheat whisky'@en        Q154   isastar
Q2535077   'Rye Whiskey'@en             Q154   isastar
Q536976    'Canadian whisky'@en         Q154   isastar
Q7991845   'Wheat whiskey'@en           Q154   isastar
Q10429117  'Beyaz'@en                   Q154   isastar
Q1069954   'Prosecco'@en                Q154   isastar
Q1094850   'Clairette du Languedoc'@en  Q154   isastar
Q1135592   'Cortese di Gavi'@en         Q154   isastar


An the total number:

In [23]:
!wc "$TEMP"/Q154.descendant.tsv

    3251   16116  133341 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.descendant.tsv


The computation of `Q154.descendant.tsv` can be implemented in SPARQL using the common `P31/P279*` graph pattern, but the query will time out if the result size is large. For example, the query will time out when requesting all descendants of chemical compounds, as there are over one million chemical compounds in Wikidata. The query can be easily done in KGTK.

### Step 2: get the incoming and outgoing edges
We want out graph to have the neighbors of all alcoholic beverages, so we need to get the incoming and outgoing edges.

The following query gets the outgoing edges.

In [24]:
!$kypher -i "$CLAIMS" -i "$TEMP"/Q154.descendant.tsv \
--match 'Q154: (n1)-[]->(), claims: (n1)-[l]->(n2)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.node1.tsv.gz

        2.34 real         1.03 user         0.36 sys


We see that we are getting several properties for our items:

In [25]:
!zcat < "$TEMP"/Q154.node1.tsv.gz | head | column -t -s $'\t'

id                                   node1     label  node2
Q1000737-P1435-Q17297633-53903946-0  Q1000737  P1435  Q17297633
Q1000737-P1454-Q460178-8ad4931b-0    Q1000737  P1454  Q460178
Q1000737-P159-Q16003-31e24011-0      Q1000737  P159   Q16003
Q1000737-P17-Q183-24107fe2-0         Q1000737  P17    Q183
Q1000737-P18-147fc9-667304f8-0       Q1000737  P18    "Marthabräuhalle 2011-04-03.jpg"
Q1000737-P31-Q131734-f97bd6f6-0      Q1000737  P31    Q131734
Q1000737-P31-Q15075508-a4c83928-0    Q1000737  P31    Q15075508
Q1000737-P373-689157-3110aade-0      Q1000737  P373   "Marthabräu"
Q1000737-P452-Q869095-f5d8e7a2-0     Q1000737  P452   Q869095
zcat: error writing to output: Broken pipe


Now get the incoming edges:

In [26]:
!$kypher -i "$CLAIMS" -i "$TEMP"/Q154.descendant.tsv \
--match 'Q154: (n1)-[]->(), claims: (n3)-[l]->(n1)' \
--return 'distinct l as id, n3 as node1, l.label as label, n1 as node2' \
-o "$TEMP"/Q154.node2.tsv.gz

        2.23 real         0.98 user         0.36 sys


Here is a sample of the edges we are getting

In [27]:
!zcat < "$TEMP"/Q154.node2.tsv.gz | head | column -t -s $'\t'

id                                  node1      label  node2
Q1350656-P279-Q1007164-7e3ecba9-0   Q1350656   P279   Q1007164
zcat: Q20713240-P279-Q1007164-b3112260-0  Q20713240  P279   Q1007164
Q2535077-P279-Q1007164-b2d3684b-0   Q2535077   P279   Q1007164
Q536976-P279-Q1007164-8bf7467b-0    Q536976    P279   Q1007164
Q7991845-P279-Q1007164-18bc383a-0   Q7991845   P279   Q1007164
Q10337004-P186-Q10210-c56dd7ce-0    Q10337004  P186   Q10210
Q10429117-P31-Q10210-d342f061-0     Q10429117  P31    Q10210
Q1051699-P279-Q10210-65d32c67-0     Q1051699   P279   Q10210
error writing to outputQ1058259-P279-Q10210-e204554a-0     Q1058259   P279   Q10210
: Broken pipe


Concatenate the incoming and outgoing edges to put them in a single file:

In [28]:
!$kgtk cat -i "$TEMP"/Q154.node1.tsv.gz -i "$TEMP"/Q154.node2.tsv.gz -o "$TEMP"/Q154.claims.tsv.gz

        1.23 real         1.10 user         0.10 sys


We have over 30,000 edges:

In [29]:
!zcat < "$TEMP"/Q154.claims.tsv.gz | wc

   28142  116045 1584824


Summary of where we are:
- Computed the list of entities below alcoholic beverage
- Found all incoming and outgoing edges to these entities; for the new entities we bring in, we have no information, we only have the q-node

Not having any information about the entities connected to the alcoholic beverages is limiting, so let's get their outgoing edges. We run the query with `Q154.claims.tsv` which will use all the entities in our graph, including the alcoholic beverages for which we already got outgoing edges; no harm done, as we can eliminate duplicated later.

In [30]:
!$kypher -i "$CLAIMS" -i "$TEMP"/Q154.claims.tsv.gz \
--match 'Q154: ()-[]->(n1), claims: (n1)-[l]->(n2)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.hop.out.tsv.gz

        5.75 real         3.92 user         0.57 sys


For sanity check, let's take a peek:

In [31]:
!zcat < "$TEMP"/Q154.hop.out.tsv.gz | head | column -t -s $'\t'

id                             node1  label  node2
Q1000-P1036-9bef62-f77ac5cf-0  Q1000  P1036  "2--6721"
Q1000-P1081-0d345f-3a33abf5-0  Q1000  P1081  +0.641
Q1000-P1081-0d345f-6da37c02-0  Q1000  P1081  +0.641
Q1000-P1081-1100e3-c7631769-0  Q1000  P1081  +0.624
Q1000-P1081-1ada51-7c71c229-0  Q1000  P1081  +0.639
Q1000-P1081-345681-88a99cab-0  Q1000  P1081  +0.702
Q1000-P1081-347db1-da0e5e03-0  Q1000  P1081  +0.637
Q1000-P1081-419245-b03a8b59-0  Q1000  P1081  +0.647
Q1000-P1081-419245-f8cd58e8-0  Q1000  P1081  +0.647
zcat: error writing to output: Broken pipe


Let's consolidate our edge files into one larger file. We use compact to remove duplicates and sort to keep edges for the same subject together:

In [32]:
!$kgtk cat -i "$TEMP"/Q154.claims.tsv.gz -i "$TEMP"/Q154.hop.out.tsv.gz \
/ compact \
/ sort2 \
-o "$TEMP"/Q154.edges.1.tsv.gz

        5.07 real         7.09 user         0.59 sys


Now we have over 170,000 edges:

In [33]:
!zcat < "$TEMP"/Q154.edges.1.tsv.gz | wc

  165133  678398 8868474


Take a peek:

In [34]:
!zcat < "$TEMP"/Q154.edges.1.tsv.gz | head | column -t -s $'\t'

id                                node1  label  node2
P1389-P1855-Q1109662-9e2ef218-0   P1389  P1855  Q1109662
P1582-P1855-Q17329207-f4ef508d-0  P1582  P1855  Q17329207
P2581-P1855-Q7639844-08b3a4c7-0   P2581  P1855  Q7639844
P2665-P1855-Q1067702-402a80a9-0   P2665  P1855  Q1067702
P2665-P1855-Q170210-30d44f0b-0    P2665  P1855  Q170210
P5420-P1855-Q44-209cffb1-0        P5420  P1855  Q44
P5420-P1855-Q722338-73d7be75-0    P5420  P1855  Q722338
zcat: P6088-P1855-Q1543214-3d934541-0   P6088  P1855  Q1543214
P6088-P1855-Q4626-4ed65964-0      P6088  P1855  Q4626
error writing to output: Broken pipe


Once we have all the alcoholic beverages, we want to get the upper ontology of all the classes used, so that every class in our KG has a path to the root of the ontology. For example, first go to `drink` (`Q40050`), then to `liquid` (`Q11435`), then `fluid` (`Q102205`) and so on until we reach `entity` (`Q35120`).

To do this, we need to get all the `isa` of all items in our graph, then get `P279star` so we get the list of all classes that these items descend from. Finally we need to get all the `P279` edges between them.

In [35]:
!$kypher -i "$TEMP"/Q154.edges.1.tsv.gz -i "$P279STAR" -i "$ISA" \
--match 'Q154: (n1)-[]->(), isa: (n1)-[]->(n2), P279: (n2)-[]->(class)' \
--return 'distinct class as node1' \
-o "$TEMP"/Q154.classes.tsv

       13.58 real         9.23 user         1.18 sys


We have almost 3,000 classes in the upper ontology for the entities in our graph:

In [36]:
!wc "$TEMP"/Q154.classes.tsv

    2846    2846   24939 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.classes.tsv


Now use the `derived.P279.tsv` file to get the `P279` edges that connect a class to its superclass.

In [37]:
!$kypher -i "$TEMP"/Q154.classes.tsv -i "$P279" \
--match 'Q154: (class)-[]->(), P279: (class)-[l]->(super)' \
--return 'distinct l as id, class as node1, l.label as label, super as node2' \
-o "$TEMP"/Q154.P279.tsv

        1.48 real         0.89 user         0.22 sys


We get close to 5,000 `P279` edges in the upper ontology; we will take care of potential duplicates at a final cleanup step:

In [38]:
!wc "$TEMP"/Q154.P279.tsv

    4517   18068  249492 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.P279.tsv


We see several q-nodes below `entity` (`Q35120`), a good indication that we computed the upper ontology correctly:

In [39]:
!grep Q35120 "$TEMP"/Q154.P279.tsv | head -5 | column -t -s $'\t'

Q16686448-P279-Q35120-674edbf9-0  Q16686448  P279  Q35120
Q35120-P279-25b964-0520e300-0     Q35120     P279  novalue
Q58415929-P279-Q35120-75659d0c-0  Q58415929  P279  Q35120
Q23958946-P279-Q35120-70a9ed90-0  Q23958946  P279  Q35120
Q488383-P279-Q35120-5fad2ad7-0    Q488383    P279  Q35120


Let's consolidate the edges again:

In [40]:
!$kgtk cat -i "$TEMP"/Q154.edges.1.tsv.gz -i "$TEMP"/Q154.P279.tsv \
/ compact \
/ sort2 \
-o "$TEMP"/Q154.edges.2.tsv.gz

        5.14 real         7.12 user         0.57 sys


We have over 175,000 edges:

In [41]:
!zcat < "$TEMP"/Q154.edges.2.tsv.gz | wc

  169047  694054 9085731


Summary:
- We have the instances of alcoholic beverages
- We added incoming and outgoing edges
- For the outgoing edges, we went one hop forward
- We got the upper ontology

The properties are also items in Wikidata, so let's collect them all and get their edges.

In [42]:
!$kypher -i "$TEMP"/Q154.edges.2.tsv.gz \
--match '()-[l {label: property}]->()' \
--return 'distinct property as node1' \
-o "$TEMP"/Q154.properties.tsv

        2.13 real         2.03 user         0.31 sys


In [43]:
!head "$TEMP"/Q154.properties.tsv | column -t -s $'\t'

node1
P10
P1001
P1003
P1004
P1005
P1006
P101
P1014
P1015


Let's get the edges of these properties:

In [44]:
!$kypher -i "$CLAIMS" -i "$TEMP"/Q154.properties.tsv \
--match 'Q154: (p)-[]->(), claims: (p)-[l]->(n2)' \
--return 'distinct l as id, p as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.properties.edges.tsv

        1.25 real         0.91 user         0.18 sys


Take a peek, looks like what we had before as the file is sorted, let's proceed:

In [45]:
!head "$TEMP"/Q154.properties.edges.tsv | column -t -s $'\t'

id                              node1  label  node2
P10-P1628-32b85d-7927ece6-0     P10    P1628  "http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0     P10    P1628  "https://schema.org/video"
P10-P1629-Q34508-bcc39400-0     P10    P1629  Q34508
P10-P1659-P1651-c4068028-0      P10    P1659  P1651
P10-P1659-P18-5e4b9c4f-0        P10    P1659  P18
P10-P1659-P4238-d21d1ac0-0      P10    P1659  P4238
P10-P1659-P51-86aca4c5-0        P10    P1659  P51
P10-P1855-Q15075950-7eff6d65-0  P10    P1855  Q15075950
P10-P1855-Q69063653-c8cdb04c-0  P10    P1855  Q69063653


Let's consolidate the edges again:

In [46]:
!$kgtk cat -i "$TEMP"/Q154.edges.2.tsv.gz -i "$TEMP"/Q154.properties.edges.tsv \
/ compact \
/ sort2 \
-o "$TEMP"/Q154.edges.3.tsv.gz

        6.18 real         8.48 user         0.65 sys


The number of edges grew a bit to 206,000

In [47]:
!zcat < "$TEMP"/Q154.edges.3.tsv.gz | wc

  197521  811687 10791930


Summary:
- We have the instances of alcoholic beverages
- We added incoming and outgoing edges
- For the outgoing edges, we went one hop forward
- We got the upper ontology
- And we have the edges on all the properties being used

We will stop adding nodes to the KG at this time, and proceed to add the labels for all the nodes.

### Step 3: get the labels, aliases and descriptions of all the items in our KG
Before we start, let's define an environment variable to hold the final edges file so that if we change our mind later, we can update it without having to change the commands below.

In [48]:
os.environ["Q154GRAPH"] = os.environ["TEMP"] + "/Q154.edges.3.tsv.gz"

In [49]:
!ls "$Q154GRAPH"

/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/Q154.edges.3.tsv.gz


Get the labels of the `node1` nodes

In [50]:
!$kypher -i "$Q154GRAPH" -i "$LABEL" \
--match 'Q154: (n1)-[]-(), label: (n1)-[l]->(n2)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.label.node1.tsv.gz

        5.02 real         2.81 user         0.87 sys


In [51]:
!zcat < "$TEMP"/Q154.label.node1.tsv.gz | head | column -t -s $'\t'

id              node1  label  node2
P10-label-en    P10    label  'video'@en
P1001-label-en  P1001  label  'applies to jurisdiction'@en
P1003-label-en  P1003  label  'National Library of Romania ID'@en
P1004-label-en  P1004  label  'MusicBrainz place ID'@en
P1005-label-en  P1005  label  'Portuguese National Library ID'@en
P1006-label-en  P1006  label  'Nationale Thesaurus voor Auteurs ID'@en
P101-label-en   P101   label  'field of work'@en
P1014-label-en  P1014  label  'Getty AAT ID'@en
P1015-label-en  P1015  label  'NORAF ID'@en
zcat: error writing to output: Broken pipe


Get the labels of the `node2` nodes

In [52]:
!$kypher -i "$Q154GRAPH" -i "$LABEL" \
--match 'Q154: ()-[]-(n2), label: (n2)-[l]->(n3)' \
--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \
-o "$TEMP"/Q154.label.node2.tsv.gz

        8.45 real         2.05 user         1.71 sys


Concatenate the two label files

In [53]:
!$kgtk cat -i "$TEMP"/Q154.label.node1.tsv.gz -i "$TEMP"/Q154.label.node2.tsv.gz \
-o "$TEMP"/labels.tsv.gz

        1.66 real         1.52 user         0.10 sys


In [54]:
!zcat < "$TEMP"/labels.tsv.gz | wc

   56123  289814 3031029


Get the aliases of `node1` nodes

In [55]:
!$kypher -i "$Q154GRAPH" -i "$ALIAS" \
--match 'Q154: (n1)-[]-(), alias: (n1)-[l]->(n2)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.alias.node1.tsv.gz

        2.55 real         1.51 user         0.37 sys


Get the aliases of `node2` nodes

In [56]:
!$kypher -i "$Q154GRAPH" -i "$ALIAS" \
--match 'Q154: ()-[]-(n2), alias: (n2)-[l]->(n3)' \
--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \
-o "$TEMP"/Q154.alias.node2.tsv.gz

        3.44 real         1.59 user         0.59 sys


Concatenate the two alias files

In [57]:
!$kgtk cat -i "$TEMP"/Q154.alias.node1.tsv.gz -i "$TEMP"/Q154.alias.node2.tsv.gz \
-o "$TEMP"/alias.tsv.gz

        1.63 real         1.49 user         0.11 sys


Get the descriptions of `node1` nodes

In [58]:
!$kypher -i "$Q154GRAPH" -i "$DESCRIPTION" \
--match 'Q154: (n1)-[]-(), description: (n1)-[l]->(n2)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.description.node1.tsv.gz

        3.09 real         1.11 user         0.52 sys


Get the descriptions of `node2` nodes

In [59]:
!$kypher -i "$Q154GRAPH" -i "$DESCRIPTION" \
--match 'Q154: ()-[]-(n2), description: (n2)-[l]->(n3)' \
--return 'distinct l as id, n2 as node1, l.label as label, n3 as node2' \
-o "$TEMP"/Q154.description.node2.tsv.gz

        8.51 real         1.94 user         1.70 sys


Concatenate the two description files

In [60]:
!$kgtk cat -i "$TEMP"/Q154.description.node1.tsv.gz -i "$TEMP"/Q154.description.node2.tsv.gz \
-o "$TEMP"/description.tsv.gz

        1.67 real         1.48 user         0.11 sys


### Step 4: get the qualifiers

In [61]:
!$kypher -i "$Q154GRAPH" -i "$QUALIFIERS" \
--match 'Q154: ()-[l]->(), qual: (l)-[lq]->(n2)' \
--return 'lq as id, l as node1, lq.label as label, n2 as node2' \
-o "$OUT"/qualifiers.tsv.gz

        5.29 real         2.44 user         0.73 sys


In [62]:
!zcat < "$TEMP"/Q154.qualifiers.tsv.gz | head | column -t -s $'\t'

zcat: error writing to output: Broken pipe
id                                                node1                           label  node2
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0       P10-P1855-Q15075950-7eff6d65-0  P10    "Smoorverliefd 12 september.webm"
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0    P10-P1855-Q15075950-7eff6d65-0  P3831  Q622550
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0       P10-P1855-Q69063653-c8cdb04c-0  P10    "Couch Commander.webm"
P10-P1855-Q7378-555592a4-0-P10-8a982d-0           P10-P1855-Q7378-555592a4-0      P10    "Elephants Dream (2006).webm"
P10-P2302-Q21502404-d012aef4-0-P1793-f4c2ed-0     P10-P2302-Q21502404-d012aef4-0  P1793  "(?i).+\\\\.(webm\\|ogv\\|ogg\\|gif)"
P10-P2302-Q21502404-d012aef4-0-P2316-Q21502408-0  P10-P2302-Q21502404-d012aef4-0  P2316  Q21502408
P10-P2302-Q21502404-d012aef4-0-P2916-cb0917-0     P10-P2302-Q21502404-d012aef4-0  P2916  'filename with extension: webm, ogg, ogv, or gif (case insensitive)'@en
P10-P2302-Q21510851-5224

In [63]:
!zcat < "$TEMP"/Q154.qualifiers.tsv.gz | wc

  109816  446163 10639203


### Step 5: consolidate all the files

In [64]:
!wget https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv -O "$TEMP"/kgtk.properties.tsv

--2020-12-23 18:28:45--  https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2617 (2.6K) [text/plain]
Saving to: ‘/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/kgtk.properties.tsv’


2020-12-23 18:28:46 (14.4 MB/s) - ‘/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/kgtk.properties.tsv’ saved [2617/2617]



In [65]:
!head "$TEMP"/kgtk.properties.tsv | column -t -s $'\t'

node1     label        node2                                   id
isa       label        "is a"@en                               isa-label-e79b73
isa       alias        "isa"@en                                isa-alias-7773c5
isa       description  "Instance or subclass relationship"@en  isa-description-0b5cdc
isa       P31          Q18616576                               isa-P31-Q18616576
isa       P31          Q28326461                               isa-P31-Q28326461
isa       P31          Q18647519                               isa-P31-Q18647519
isa       data_type    wikibase-item                           isa-data_type-643cc9
P279star  label        "is a"@en                               P279star-label-e79b73
P279star  alias        "isa"@en                                P279star-alias-7773c5


check

In [66]:
!zcat < "$PROPERTY_DATATYPES" | head

id	node1	label	node2
P10-datatype	P10	datatype	commonsMedia
P1000-datatype	P1000	datatype	wikibase-item
P1001-datatype	P1001	datatype	wikibase-item
P1002-datatype	P1002	datatype	wikibase-item
P1003-datatype	P1003	datatype	external-id
P1004-datatype	P1004	datatype	external-id
P1005-datatype	P1005	datatype	external-id
P1006-datatype	P1006	datatype	external-id
P1007-datatype	P1007	datatype	external-id
zcat: error writing to output: Broken pipe


In [67]:
!$kypher -i "$Q154GRAPH" -i "$PROPERTY_DATATYPES" \
--match 'Q15: (n1)-[]->(), property: (n1)-[l:datatype]->(n2)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q154.metadata.property.datatype.tsv.gz

        1.26 real         0.76 user         0.11 sys


In [85]:
!zcat < "$TEMP"/Q154.metadata.property.datatype.tsv.gz | head

id	node1	label	node2
P10-datatype	P10	datatype	commonsMedia
P1001-datatype	P1001	datatype	wikibase-item
P1003-datatype	P1003	datatype	external-id
P1004-datatype	P1004	datatype	external-id
P1005-datatype	P1005	datatype	external-id
P1006-datatype	P1006	datatype	external-id
P101-datatype	P101	datatype	wikibase-item
P1014-datatype	P1014	datatype	external-id
P1015-datatype	P1015	datatype	external-id


In [68]:
!$kgtk cat \
-i "$TEMP"/labels.tsv.gz \
-i "$TEMP"/alias.tsv.gz \
-i "$TEMP"/description.tsv.gz \
-i "$TEMP"/Q154.edges.3.tsv.gz \
-i "$TEMP"/kgtk.properties.tsv \
-i "$TEMP"/Q154.metadata.property.datatype.tsv.gz \
/ compact \
/ sort2 \
-o "$OUT"/all.tsv.gz

        8.95 real        11.87 user         0.73 sys


In [69]:
!$kypher -i "$TEMP"/Q154.edges.3.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)'

count(DISTINCT graph_35_c1."node1")
13147
        0.92 real         0.79 user         0.10 sys


In [70]:
!zcat < "$OUT"/all.tsv.gz | wc

  346639 1718566 20581359


## Step 6: partition the files to follow the conventions KGTK uses for Wikidata

We'll use the partition-wikidata notebook to complete this step. This notebook expects an input file that includes all edges and qualifiers together. We also need to specify a directory where partitioned files should be created, and a directory where temporary files can be sent (this should be different from our temp directory as the partition notebook will clear any existing files in this folder).

In [71]:
!mkdir $OUT/parts

mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts: File exists


In [72]:
!$kgtk cat -i $OUT/all.tsv.gz -i $OUT/qualifiers.tsv.gz -o $TEMP/all_and_qualifiers.tsv.gz

        6.40 real         6.18 user         0.16 sys


In [86]:
!zcat < $TEMP/all_and_qualifiers.tsv.gz | head

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653
zcat: error writing to output: Broken pipe


In [87]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["TEMP"] + "/all_and_qualifiers.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)

HBox(children=(HTML(value='Executing'), FloatProgress(value=0.0, max=49.0), HTML(value='')))




{'cells': [{'cell_type': 'markdown',
   'metadata': {'tags': [],
    'papermill': {'exception': False,
     'start_time': '2020-12-24T04:01:18.778765',
     'end_time': '2020-12-24T04:01:18.804088',
     'duration': 0.025323,
     'status': 'completed'}},
   'source': '# Partitioning a subset of Wikidata\n\nThis notebook illustrates how to partition a Wikidata KGTK edges file.\n\nParameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n\n```\npapermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\n-p wikidata_input_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz \\\n-p wikidata_parts_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts \\\n```\n\nHere is a sample of the records that might appear in the input KGTK file:\n```\nid\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\tlang\nQ1-P1036-418bc4-78f5a565-0\tQ1\tP1036\t"113"\tnormal\texternal-id\t\nQ1-P13

The partition-wikidata notebook created the following partitioned kgtk-files:

In [88]:
!ls $OUT/parts

aliases.en.tsv.gz                  metadata.property.datatypes.tsv.gz
aliases.tsv.gz                     metadata.types.tsv.gz
all.tsv.gz                         qualifiers.tsv.gz
claims.tsv.gz                      sitelinks.en.tsv.gz
descriptions.en.tsv.gz             sitelinks.qualifiers.en.tsv.gz
descriptions.tsv.gz                sitelinks.qualifiers.tsv.gz
labels.en.tsv.gz                   sitelinks.tsv.gz
labels.tsv.gz                      [34mtemp[m[m


In [75]:
!$kypher -i $OUT/parts/claims.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)'

count(DISTINCT graph_36_c1."node1")
13153
        2.61 real         2.55 user         0.37 sys


# Embeddings

## Graph Embeddings

Normally, we would use `Q154ITEM`, but the partioning failed so we will compute it using kypher

In [78]:
!zcat < "$Q154ITEM" | head

/bin/bash: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz: No such file or directory


In [100]:
!zcat < "$TEMP"/Q154.edges.3.tsv.gz | wc

  197521  811687 10791930


In [118]:
!$kypher -i "$TEMP"/Q154.edges.3.tsv.gz -i "$TEMP"/Q154.metadata.property.datatype.tsv.gz -i "$Q154LABEL" \
--match 'edges: (n1)-[l {label: property}]->(n2), datatype: (property)-[]->(dt:`wikibase-item`), label: (n1)-[]->(lab)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$GE"/geinput.tsv

        0.83 real         0.66 user         0.15 sys


We have over 60,000 lines:

In [6]:
!wc "$GE"/geinput.tsv

   66490  265960 3297462 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv


Compute the graph embeddings using the default settings. Our output file `translation.txt` will be in word2vec format so we can usi it diectly in gensim

In [161]:
!$kgtk graph-embeddings --verbose -i "$GE"/geinput.tsv \
-o "$GE"/embeddings.txt \
--retain_temporary_data True \
--operator translation \
--workers 5 \
--log "$GE"/ge.log \
-T "$GE" \
-ot w2v \
-e 300

In Processing, Please go to /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/ge.log to check details
Opening the input file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv
header: id	node1	label	node2
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Opening the output file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv
File_path.suffix: .tsv
KgtkWriter: writing file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv
header: id	node1	label	node2
Processing the input records.
Processed 66489 records.
Processed Finished.
      193.64 real       958.24 user       107.56 sys


Let's look at the output direcory

In [79]:
!ls -hl "$GE"

total 446864
-rw-r--r--   1 pedroszekely  staff   101K Dec 26 16:09 Q27.sim.tsv
-rw-r--r--   1 pedroszekely  staff    44K Dec 25 22:18 Q27.tsv
-rw-r--r--   1 pedroszekely  staff   177K Dec 26 16:09 Q29.Q45.Q142.sim.tsv
-rw-r--r--   1 pedroszekely  staff    43K Dec 25 22:36 Q29.Q45.sim.tsv
-rw-r--r--   1 pedroszekely  staff    85K Dec 26 16:09 Q29.sim.tsv
-rw-r--r--   1 pedroszekely  staff    79K Dec 26 16:09 Q332378.sim.tsv
-rw-r--r--   1 pedroszekely  staff    88K Dec 26 16:09 Q374.sim.tsv
-rw-r--r--   1 pedroszekely  staff    87K Dec 26 16:09 Q502268.sim.tsv
-rw-r--r--   1 pedroszekely  staff    44K Dec 25 22:11 Q502268.tsv
-rw-r--r--   1 pedroszekely  staff   4.3K Dec 25 21:33 Q610672.tsv
-rw-r--r--   1 pedroszekely  staff    53M Dec 23 23:23 embeddings.txt
-rw-r--r--   1 pedroszekely  staff   480K Dec 23 23:23 ge.log
-rw-r--r--   1 pedroszekely  staff   3.1M Dec 23 22:02 geinput.tsv
-rw-r--r--   1 pedroszekely  staff   973K Dec 23 12:41 geinput.tsv.gz
drwxr-xr-x  10 pedroszekely  s

Let's peek at the file, we have 44K vectors of dimension 100

In [80]:
!head -2 "$GE"/embeddings.txt

44419 100
Q243611 -0.331411451 -0.152568206 -0.139386058 -0.121394955 -0.334799886 0.023394363 -0.024942441 -0.137579590 0.084599547 0.876167953 -0.222018719 -0.168754980 -0.027932534 -0.289450347 0.250572681 -0.633476973 -0.440892249 -0.178823337 0.299026161 -0.407618254 -0.036977571 0.032356881 -0.081695572 -0.055025205 -0.182957411 -0.250380307 0.535348237 -0.108279251 0.452128828 -0.346319675 0.042611640 0.338040203 0.171208084 -0.275558919 0.114576176 -0.198427215 -0.277292132 -0.149741501 -0.327517658 0.146066576 0.431715995 0.481242269 -0.124767415 -0.171481445 -0.394009471 -0.305026233 0.223357961 0.360154629 0.213194653 0.012373813 -0.405227572 0.052000813 0.084122777 0.072465442 0.241527051 0.314641565 -0.258469820 0.122197300 -0.385967076 -0.472052187 -0.090907939 -0.102187648 0.184509873 0.132856295 0.402841479 0.585462868 0.695401728 0.060416430 -0.322626084 -0.238338873 0.333650321 0.479767382 -0.366145641 0.051905960 0.275238752 0.429640323 -0.370602965 0.055560533 0.609

Load the vecotrs in gensim

In [81]:
path = os.environ['GE'] + "/embeddings.txt"
ge_vectors = KeyedVectors.load_word2vec_format(path, binary=False)

In [82]:
# Q502268 is Johnnie Walker
ge_vectors['Q502268']

array([-0.71844614, -0.72041976,  0.819834  , -0.07249352,  0.24403723,
        0.60705996, -0.5666862 , -0.5559557 ,  0.686424  ,  0.6667965 ,
       -0.46009716,  0.4207767 , -0.17946522, -0.18458156, -1.0764353 ,
        1.056981  , -0.06046142,  0.00866301, -0.02163753, -0.3418129 ,
       -0.03871485, -0.14953642,  0.8018838 ,  0.19381396, -0.10066328,
        0.884025  , -0.08962934, -0.36985362, -0.3394345 ,  0.671762  ,
        0.11509704, -0.6489555 , -0.22910565, -0.6392556 ,  0.8204702 ,
       -0.260422  ,  0.4548083 ,  0.06683284, -0.09605702,  0.23433112,
        0.4129733 ,  0.05630195, -0.24607319, -0.19756897,  0.3878965 ,
        0.08242382,  0.07034106,  0.14290804,  0.07523334, -0.16040339,
        0.02874546, -0.0554648 ,  0.00764391, -0.6856189 , -0.3701922 ,
       -0.23979117,  0.26580626,  0.01087183, -1.2511953 ,  0.01297893,
       -0.23593499, -0.16515297, -0.2442124 , -0.10745924,  1.16383   ,
       -0.8887456 ,  0.7308084 , -0.02755331,  1.395485  , -0.34

Find the most similar qnodes to `Q15874936`, the qnode for Michelob.

In [83]:
ge_vectors.most_similar(positive=['Q15874936'], topn=5)

[('Q610672', 0.9267997741699219),
 ('Q48799234', 0.7637178897857666),
 ('Q85269976', 0.762772262096405),
 ('Q5647008', 0.7582801580429077),
 ('Q5149389', 0.7565429210662842)]

This is hard to use because the reuslt are qnodes and we have no idea what they are. Let's define a function to fetch the labels and descriptions so that we can interpret the results more easily

`kgtk_most_similar` is a wrapper to gensim's `most_similar` function, and it is designed to output the results in KGTK format. The `kgtk_path` is required if we want to output the labels and descriptios as this path is where the `labels.en.tsv.gz` and `descriptions.en.tsv.gz` files care stored. You can optionally provide a `output_path` to tell it to sotre the results in a file; otherwise the results will be returned as a dataframe.

In [84]:
def kgtk_most_similar(
    vectors,
    positive,
    relation_label="similarity_score",
    kg_path=None,
    add_label_description=True,
    output_path=None,
    topn=25,
):
    """"""
    result = []
    if add_label_description and kg_path:
        fp = tempfile.NamedTemporaryFile(
            mode="w", suffix=".tsv", delete=False, encoding="utf-8"
        )
        fp.write("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            fp.write("{}\t{}\t{}\n".format(qnode, relation_label, similarity))
        filename = fp.name
        fp.close()

        os.environ["_label_graph"] = kg_path + "/labels.en.tsv.gz"
        os.environ["_description_graph"] = kg_path + "/descriptions.en.tsv.gz"
        os.environ["_temp_file"] = filename

        result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_temp_file" --as sim \
--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, lab as `node1;label`, des as `node1;description`' \
--order-by 'cast(similarity, float) desc' 
        
        os.remove(filename)
        
    else:
        result.append("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            result.append("{}\t{}\t{}\n".format(qnode, relation_label, similarity))

    if output_path:
        handle = open(output_path, "w")
        for line in result:
            handle.write(line)
            handle.write("\n")
        handle.close()
    else:
        columns = result[0].split("\t")
        data = []
        for line in result[1:]:
            data.append(line.split("\t"))
        return pd.DataFrame(data, columns=columns)

Let's give it a try:

In [85]:
# Q15874936 is Michelob
kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q610672,0.926799774169922,similarity,'Budweiser'@en,'brand of pale lager'@en
1,Q48799234,0.7637178897857666,similarity,'Virginia Black Whiskey'@en,'super-premium brand of American Bourbon whisk...
2,Q85269976,0.762772262096405,similarity,'Busch Beer'@en,'brand of beer owned by Anheuser-Busch'@en
3,Q5149389,0.7565429210662842,similarity,'Colt 45'@en,'malt liquor'@en
4,Q3079990,0.752647340297699,similarity,'Four Loko'@en,'Drink'@en
5,Q96952363,0.7438719272613525,similarity,'Cronk'@en,'American drink'@en
6,Q7085533,0.7436875104904175,similarity,'Olde English 800'@en,'malt liquor'@en


## Text embeddings

In [9]:
!zcat < $OUT/all.tsv.gz | head -500 > $TEMP/all.500.tsv

zcat: error writing to output: Broken pipe


In [12]:
!head $TEMP/all.500.tsv

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653


Explain the command here

In [None]:
!$kgtk text-embedding -i $OUT/all.tsv.gz \
--embedding-projector-metadata-path none \
--label-properties label \
--isa-properties P31 P279 P452 P106 \
--description-properties description \
--property-value P186 P17 P127 P176 P169 \
--has-properties "" \
-f kgtk_format \
--output-data-format kgtk_format \
--save-embedding-sentence \
--model bert-large-nli-cls-token \
-o "$TE" \
> "$TE"/text-embedding.tsv

Duration --parallel 1
16348.11 real     16066.21 user       315.45 sys

The text embeddings are output in KGTK format and we need them in word2vec format (need to enhance the command to produce w2v format). For now, define a function to convert the KGTK embeddings to w2v format.

In [110]:
def convert_kgtk_to_w2v(input_path, output_path, text_embedding_label="text_embedding"):
    """
    Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format
    """
    vector_count = 0
    vector_length = 0
    
    # Read the file once to count the lines as we need to put them at the top of the w2v file
    with open(input_path, "r") as kgtk_file:
        next(kgtk_file)
        for line in kgtk_file:
            items = line.split("\t")
            qnode = items[0]
            label = items[1]
            if label == text_embedding_label:
                if vector_count == 0:
                    vector_length = len(items[2].split(","))
                vector_count += 1
        kgtk_file.close()

    with open(output_path, "w") as w2v_file:
        w2v_file.write("{} {}\n".format(vector_count, vector_length))
        with open(input_path, "r") as kgtk_file:
            next(kgtk_file)
            for line in kgtk_file:
                items = line.split("\t")
                qnode = items[0]
                label = items[1]
                if label == text_embedding_label:
                    vector = items[2].replace(",", " ")
                    w2v_file.write(qnode + " " + vector)
            kgtk_file.close()
        w2v_file.close()

In [111]:
convert_kgtk_to_w2v(os.environ['TE'] + "/text-embedding.tsv", os.environ['TE'] + "/embeddings.txt")

Let's look at the output file, the embeddings have 1024 dimensions

In [146]:
!head -2 "$TE"/embeddings.txt

56017 1024
undirected_pagerank -0.42267796 0.3995441 0.5533569 -0.71286017 0.35639343 0.23904479 -0.2763573 0.37157294 -0.4283453 1.3224101 0.6862846 0.19590487 -0.6082015 -0.11240994 0.33890438 -0.20922732 -0.23069456 -0.021294963 -1.912606 0.49719235 0.6929876 0.011938913 -1.5600294 0.20473605 -0.17875122 0.45237 -0.09061487 0.0838695 0.039139077 -0.5781012 -0.2535121 0.065458305 -0.34608266 -0.42478928 -0.4474916 -0.23409875 -0.13160512 -0.076800026 -0.6984711 0.12516521 -0.42880625 -0.85138726 0.04815936 -0.6207587 -0.08866266 -1.6658425 -0.51067406 -0.34878105 0.33144328 -0.69933593 -0.36479193 -0.6388813 0.76048696 0.12395467 -0.88557744 0.34427696 1.2574033 -0.65131736 -0.9506962 0.6257681 0.36623836 0.716814 0.36953598 -1.3571995 0.2660646 -1.2076085 0.09180403 -0.36115 0.42118248 -0.92440283 -0.32160524 -0.14557533 -0.50016695 -0.12131537 -0.74813855 0.5254087 0.42912796 -0.73770857 -0.39519224 1.1647401 0.63930184 -0.33095387 -0.17238976 0.19148383 -0.31919938 -0.7583614 0.15

Load the text embeddings in gensim

In [148]:
te_path = os.environ['TE'] + "/text-embedding.w2v.txt"
te_vectors = KeyedVectors.load_word2vec_format(te_path, binary=False)

### Compare the graph and text embeddings

Most similar nodes to Johnnie Walker using the **graph embeddings**

In [86]:
# Q502268 is Johnnie Walker
kgtk_most_similar(ge_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q4865371,0.833085298538208,similarity,'Bartlet for America'@en,'episode of The West Wing (S3 E9)'@en
1,Q7084279,0.8258047103881836,similarity,'Old Ironsides'@en,'1926 film by James Cruze'@en
2,Q7736602,0.8078582286834717,similarity,'The Girl of the Golden West'@en,'1930 film by John Francis Dillon'@en
3,Q1799948,0.8060345649719238,similarity,'Ladies of Leisure'@en,'1930 film by Frank Capra'@en
4,Q2288328,0.8006598949432373,similarity,'The Matinee Idol'@en,"'1928 film by Walt Disney, Frank Capra'@en"
5,Q628737,0.7132620811462402,similarity,'Campbeltown Single Malts'@en,'single malt Scotch whiskies distilled in the ...
6,Q280,0.6832661032676697,similarity,'Lagavulin Distillery'@en,"'Scotch whisky distillery in Lagavulin, Islay,..."
7,Q1761185,0.6419662237167358,similarity,'Pimm\\\\'s'@en,'alcohol brand'@en
8,Q96278979,0.6371052861213684,similarity,'Lagavulin 16 years whisky'@en,'Lagavulin 16 years single malt scotch whisky'@en


Most similar nodes to Johnnie Walker using the **text embeddings**

In [150]:
# Q502268 is Johnnie Walker
kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q280,0.9379171133041382,similarity,'Lagavulin Distillery'@en,"'Scotch whisky distillery in Lagavulin, Islay,..."
1,Q2490031,0.9346836805343628,similarity,'William Grant & Sons'@en,'Scottish company which distills Scotch whisky...
2,Q1543646,0.9012988805770874,similarity,'Rob Roy'@en,'cocktail based on Scotch whisky'@en
3,Q2168523,0.8907997012138367,similarity,'The Famous Grouse'@en,'brand of Scotch whisky'@en
4,Q1069502,0.8856703042984009,similarity,'Chivas Regal'@en,'Blended Scotch Whisky produced by Chivas Brot...
5,Q4821838,0.8762272596359253,similarity,'Aultmore distillery'@en,"'whisky distillery in Moray, Scotland, UK'@en"
6,Q4720319,0.8761684894561768,similarity,'Alexander Walker'@en,'Scottish whisky distiller'@en
7,Q1754978,0.8664095401763916,similarity,'Rusty Nail'@en,'cocktail mixing Drambuie and Scotch whisky'@en
8,Q42032478,0.8583760857582092,similarity,'Tiree Whisky Company'@en,'company that sells whisky on the island of Ti...
9,Q20031443,0.8488548994064331,similarity,'Something Special'@en,'blended Scotch whisky'@en


The graph embeddings produce poor results as the top matches are not related to whiskey. The text embeddings look much better.

Most similar nodes to Michelob using the **graph embeddings**

In [152]:
# Q15874936 is Michelob
kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q610672,0.926799774169922,similarity,'Budweiser'@en,'brand of pale lager'@en
1,Q48799234,0.7637178897857666,similarity,'Virginia Black Whiskey'@en,'super-premium brand of American Bourbon whisk...
2,Q85269976,0.762772262096405,similarity,'Busch Beer'@en,'brand of beer owned by Anheuser-Busch'@en
3,Q5149389,0.7565429210662842,similarity,'Colt 45'@en,'malt liquor'@en
4,Q3079990,0.752647340297699,similarity,'Four Loko'@en,'Drink'@en
5,Q96952363,0.7438719272613525,similarity,'Cronk'@en,'American drink'@en
6,Q7085533,0.7436875104904175,similarity,'Olde English 800'@en,'malt liquor'@en


Most similar nodes to Michelob using the **text embeddings**

In [149]:
# Q15874936 is Michelob
kgtk_most_similar(te_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q2011473,0.9664472341537476,similarity,'Fantôme'@en,'brand of beer'@en
1,Q3315575,0.9586231708526612,similarity,'Bersalis'@en,'beer brand'@en
2,Q3518554,0.9563601016998292,similarity,'Floris'@en,'beer brand'@en
3,Q15076069,0.9531255960464478,similarity,'Marckloff'@en,'beer brand'@en
4,Q1277388,0.9511646628379822,similarity,'Pripps Blå'@en,'beer brand'@en
5,Q1917255,0.9475076794624328,similarity,'St-Idesbald'@en,'beer'@en
6,Q263980,0.9443504810333252,similarity,'Soproni'@en,'beer mark'@en
7,Q3337782,0.9438232779502868,similarity,'Carrousel'@en,'Beer'@en


THe graph embeddings contain some bad results, but the top matches are better as they include beers that are more closely related to Michelob. The text embeddings are reasonable as they include only beers.

Most similar nodes to vodka using the **graph embeddings**

In [153]:
# Q374 is vodka
kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q20577688,0.8814862966537476,similarity,'.vodka'@en,'top-level Internet domain'@en
1,Q7468032,0.8503187894821167,similarity,'Vodka'@en,'Detective Conan character'@en
2,Q11328065,0.8384641408920288,similarity,'Balalaika'@en,"'Japanese short drink, cocktail'@en"
3,Q21189725,0.8248207569122314,similarity,'Red Eye Louie\\\\'s Vodquila'@en,'blend of vodka and tequila'@en
4,Q2206588,0.8186914920806885,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
5,Q920412,0.8170762062072754,similarity,'Belvédère'@en,'French wine and spirits producer and distribu...
6,Q7151801,0.8166672587394714,similarity,'Category:Vodkas'@en,'Wikimedia category'@en
7,Q23712704,0.8152912855148315,similarity,'EB-11 / Vodka'@en,'encyclopedic article'@en
8,Q1539525,0.8101651668548584,similarity,'Stolichnaya'@en,'vodka brand'@en


Most similar nodes to vodka using the **text embeddings**

In [154]:
# Q374 is vodka
kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q4869283,0.959851622581482,similarity,'Batini'@en,'vodka-based cocktail'@en
1,Q3562046,0.959536910057068,similarity,'Vodka Stinger'@en,'type of cocktail'@en
2,Q2206588,0.943680465221405,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
3,Q22236238,0.9384630918502808,similarity,'Mariette'@en,"'vodka, alcohol'@en"
4,Q7939317,0.9203515648841858,similarity,'Vodka Cruiser'@en,'brand of vodka-based alcoholic drink'@en
5,Q11802565,0.9155371189117432,similarity,'Pan Tadeusz'@en,'brand of vodka'@en
6,Q268057,0.9129104614257812,similarity,'cosmopolitan'@en,'cocktail made with vodka'@en
7,Q4782617,0.9107505679130554,similarity,'Aqua Velva'@en,'vodka and gin based cocktail'@en


The graph embeddings are noisy as the top matches include nodes not related to vodka, the text embeddings look much better.

In [211]:
# Q27 Ireland
kgtk_most_similar(ge_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q9676,0.8613677024841309,similarity,'Isle of Man'@en,'British Crown dependency'@en
1,Q1263077,0.8335838317871094,similarity,'DAA'@en,'company that owns and operates Dublin Airport...
2,Q4368623,0.8250888586044312,similarity,'Category:Republic of Ireland'@en,'Wikimedia category'@en
3,Q164421,0.8058757781982422,similarity,'Connacht'@en,'province in Ireland'@en
4,Q184760,0.8017445802688599,similarity,'County Monaghan'@en,'county in Ireland'@en
5,Q178283,0.7986090183258057,similarity,'County Limerick'@en,'county in Ireland'@en
6,Q186220,0.7974875569343567,similarity,'County Longford'@en,'county in Ireland'@en
7,Q184594,0.7974545359611511,similarity,'County Waterford'@en,'county in Ireland'@en
8,Q93195,0.793678879737854,similarity,'Ulster'@en,'province in Ireland'@en
9,Q187402,0.788328230381012,similarity,'County Cavan'@en,'county in Ireland'@en


In [210]:
# Q27 Ireland
kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q191,0.7959819436073303,similarity,'Estonia'@en,'sovereign state in Northern Europe'@en
1,Q37,0.7896063327789307,similarity,'Lithuania'@en,'sovereign state in Northeastern Europe'@en
2,Q34,0.7771986722946167,similarity,'Sweden'@en,'sovereign state in Northern Europe'@en
3,Q35,0.7717932462692261,similarity,'Denmark'@en,'sovereign state and Scandinavian country in n...
4,Q756617,0.7578498125076294,similarity,'Kingdom of Denmark'@en,"'sovereign unitary state in Europe, the Arctic..."
5,Q33,0.7564055919647217,similarity,'Finland'@en,'sovereign state in Northern Europe'@en
6,Q16965019,0.7521861791610718,similarity,'North borough of Brescia'@en,'one of 5 boroughs of Brescia'@en
7,Q1526538,0.7520326972007751,similarity,'Reykjavík North'@en,'one of the six constituencies (kjördæmi) of I...
8,Q189,0.7486690282821655,similarity,'Iceland'@en,"'sovereign state in Northern Europe, situated ..."
9,Q22,0.7369431257247925,similarity,'Scotland'@en,"'country in Northwest Europe, part of the Unit..."


### Using the embeddings in queries to the KG

In [164]:
# Q281 whiskey
# Q282 wine
# Q3246609 mixed drink
# Q374 vodka
# Q332378 is absolut

Get the most similar nodes to **absolut**, the swedish vodka using the text embeddings and put it in a file

In [320]:
# Q332378 is absolut
kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q332378.sim.tsv")

In [321]:
result = !head "$TE"/Q332378.sim.tsv
kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q7312560,0.9494208097457886,similarity,'Renat'@en,'Swedish vodka'@en
1,Q406157,0.9068878293037416,similarity,'bäsk'@en,'Swedish style spiced liquor'@en
2,Q1034035,0.8990318775177002,similarity,'Finlandia Vodka'@en,'Finnish brand of vodka'@en
3,Q374,0.8908252716064453,similarity,'vodka'@en,'distilled alcoholic beverage'@en
4,Q2553569,0.8900324106216431,similarity,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en
5,Q2206588,0.8866583108901978,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
6,Q268057,0.8860777616500854,similarity,'cosmopolitan'@en,'cocktail made with vodka'@en
7,Q4021706,0.8785413503646851,similarity,'Xan'@en,'Vodka from Goygol'@en
8,Q4869283,0.8784171342849731,similarity,'Batini'@en,'vodka-based cocktail'@en


Suppose I have absolut vodka and I want to make a cocktail. I can use the KG graph of the most similar nodes to absolut, and search the KG for mixed drinks (`Q3246609`) that appear in the list of most similar nodes to absolut.

Here are some drinks we can make with absolut vodka.

In [323]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$TE"/Q332378.sim.tsv -i "$Q154CLAIMS" -i "$Q154LABEL" \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), \
  claims: (n1)-[:P186]->(:Q374), claims: (n1)-[:P186]->(ingredient), label: (ingredient)-[]->(i_label)' \
--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description, \
  ingredient as ingredient, i_label as `ingredient label`' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 20 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,node1;label,node1;description,ingredient,ingredient label
0,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q1105343,'cocktail glass'@en
1,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q1621080,'olive'@en
2,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q26877166,'lemon twist'@en
3,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q26877423,'dry vermouth'@en
4,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q374,'vodka'@en
5,Q2206588,0.8866583108901978,'Caipiroska'@en,'cocktail prepared with vodka'@en,Q374,'vodka'@en
6,Q1966883,0.8709859848022461,'Yorsh'@en,'Russian drink of beer and vodka'@en,Q374,'vodka'@en
7,Q1966883,0.8709859848022461,'Yorsh'@en,'Russian drink of beer and vodka'@en,Q44,'beer'@en
8,Q1723060,0.8683922290802002,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en",Q1105343,'cocktail glass'@en
9,Q1723060,0.8683922290802002,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en",Q3539556,'triple sec'@en


In [291]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$TE"/Q332378.sim.tsv -i "$Q154CLAIMS" \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), claims: (n1)-[:P186]->(:Q374)' \
--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,node1;label,node1;description
0,Q1966883,0.7984070181846619,'Yorsh'@en,'Russian drink of beer and vodka'@en
1,Q2206588,0.7781851291656494,'Caipiroska'@en,'cocktail prepared with vodka'@en
2,Q5580053,0.7759937047958374,'Golden Russian'@en,'cocktail of vodka and Galliano'@en
3,Q2553569,0.7755716443061829,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en
4,Q26883085,0.7711346745491028,'Russian Spring Punch'@en,'sparkling cocktail'@en
5,Q455914,0.7694578170776367,'Vodka Red Bull'@en,'alcoholic beverage'@en
6,Q1723060,0.7578018307685852,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en"
7,Q621302,0.757564902305603,'Appletini'@en,'apple-flavored vodka cocktail'@en
8,Q8032131,0.7451797723770142,'Woo Woo'@en,"'alcoholic beverage made of vodka, peach schna..."
9,Q1507096,0.744042158126831,'Moscow mule'@en,"'mule cocktail with vodka, ginger beer and lim..."


The results are good, lots of choices of cocktails. Note that the embeddings are able to generalize from a specific vodka to vodka in general. The example also illustrates that KGTK can use the results of queries to gensim within queries to the KG.

In [195]:
# Q332378 is absolut
kgtk_most_similar(ge_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=2000, output_path=os.environ['GE'] + "/Q332378.sim.tsv")

In [199]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$GE"/Q332378.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

['node1\tnode2\tlabel\tnode1;label\tnode1;description', "Q3527971\t0.4424980580806732\tsimilarity\t'Ti\\\\\\\\\\\\\\\\'Punch'@en\t'cocktail'@en", "Q594392\t0.38892069458961487\tsimilarity\t'B-52'@en\t'cocktail of coffee liqueur, Irish cream and triple sec'@en", "Q7535970\t0.37358343601226807\tsimilarity\t'Skittle Bomb'@en\t'bomb shot cocktail'@en", "Q7209010\t0.37143874168395996\tsimilarity\t'Polar Bear'@en\t'mint chocolate cocktail'@en", "Q3309707\t0.37052232027053833\tsimilarity\t'Hawaiian Punch'@en\t'Fruit punch brand'@en", "Q12738893\t0.3702288269996643\tsimilarity\t'Quentão'@en\t'Brazilian hot drink made \u200b\u200bfrom cachaça and some spices'@en", "Q2935472\t0.36788904666900635\tsimilarity\t'Campari Soda'@en\t'pre-mixed drink made by Campari'@en", "Q70428\t0.3663345277309418\tsimilarity\t'Karsk'@en\t'Scandinavian cocktail'@en", "Q590793\t0.3614485263824463\tsimilarity\t'Vesper'@en\t'cocktail originally made of gin, vodka, and Kina Lillet'@en"]


Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q3527971,0.4424980580806732,similarity,'Ti\\\\\\\\'Punch'@en,'cocktail'@en
1,Q594392,0.3889206945896148,similarity,'B-52'@en,"'cocktail of coffee liqueur, Irish cream and t..."
2,Q7535970,0.373583436012268,similarity,'Skittle Bomb'@en,'bomb shot cocktail'@en
3,Q7209010,0.3714387416839599,similarity,'Polar Bear'@en,'mint chocolate cocktail'@en
4,Q3309707,0.3705223202705383,similarity,'Hawaiian Punch'@en,'Fruit punch brand'@en
5,Q12738893,0.3702288269996643,similarity,'Quentão'@en,'Brazilian hot drink made ​​from cachaça and s...
6,Q2935472,0.3678890466690063,similarity,'Campari Soda'@en,'pre-mixed drink made by Campari'@en
7,Q70428,0.3663345277309418,similarity,'Karsk'@en,'Scandinavian cocktail'@en
8,Q590793,0.3614485263824463,similarity,'Vesper'@en,"'cocktail originally made of gin, vodka, and K..."


The results are poor as for the most part, the retrieved cocktails do not have vodka. Let's try the query with vodka instead of absolut vodka

In [200]:
# Q374 vodka
kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q374.sim.tsv")

In [203]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$GE"/Q374.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q11328065,0.8384641408920288,similarity,'Balalaika'@en,"'Japanese short drink, cocktail'@en"
1,Q2206588,0.8186914920806885,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
2,Q3562046,0.6592038869857788,similarity,'Vodka Stinger'@en,'type of cocktail'@en
3,Q1966883,0.5952204465866089,similarity,'Yorsh'@en,'Russian drink of beer and vodka'@en
4,Q5459745,0.5736489295959473,similarity,'flirtini'@en,"'cocktail containing vodka, champagne and pine..."
5,Q455914,0.5721926093101501,similarity,'Vodka Red Bull'@en,'alcoholic beverage'@en
6,Q5103598,0.5712590217590332,similarity,'Chocolate Cake'@en,'cocktail'@en
7,Q26879480,0.5568693280220032,similarity,'Godmother'@en,'cocktail'@en
8,Q5580053,0.5458002090454102,similarity,'Golden Russian'@en,'cocktail of vodka and Galliano'@en
9,Q3900577,0.5457539558410645,similarity,'Pertini'@en,'cocktail drink with honey'@en


The results are good. Somehow, the graph embeddings are able to rerieve the cocktails that have vodka, but cannot generalize from absolut vodka to vodka.

## Produce files to load in the Google Embedding Projector
We need two files:

- a TSV file with the vectors
- a TSV file with the metadata, in the same order as the vectors

We don't want to load all the vectors in the projectors because it is too many to visualize. We will load only the following types:

In [89]:
focus_types = {
    "Q3246609": "mixed drink",
    "Q44": "beer",
    "Q282": "wine",
    "Q281": "whiskey",
    "Q374": "vodka",
    "Q6256": "country",
}

Construct a dictionary that maps every q-node in the KG to the set of all its superclasses. We will use this dictionary later to tag each q-node with one of the focus types. For every q-node we willtest if the focus type is in the set of all super-classes.

In [90]:
classes_result = !$kypher_raw -i "$ISA" -i "$Q154CLAIMS" -i "$TEMP"/Q154.descendant.tsv -i "$P279STAR" \
--match 'isa: (n1)-[]->(c), P279: (c)-[]->(class), claims: ()-[]->(class), descendant: (n1)-[]->()' \
--return 'distinct n1 as qnode, class as class' 

class_dict = {}
for r in classes_result[1:]:
    row = r.split("\t")
    qnode = row[0]
    isa = row[1]
    entry = class_dict.get(qnode)
    if entry is None:
        class_dict[qnode] = set()
        entry = class_dict[qnode]
    entry.add(isa)

In [91]:
class_dict['Q502268']

{'Q102205',
 'Q1048607',
 'Q11024',
 'Q11028',
 'Q11064354',
 'Q111352',
 'Q11435',
 'Q1150070',
 'Q1166770',
 'Q11795009',
 'Q1190554',
 'Q1194058',
 'Q12055130',
 'Q124291',
 'Q12767945',
 'Q131257',
 'Q13878858',
 'Q1400881',
 'Q1422299',
 'Q14819853',
 'Q14912053',
 'Q154',
 'Q15401930',
 'Q1554231',
 'Q1632297',
 'Q16686448',
 'Q16722960',
 'Q167270',
 'Q1681365',
 'Q16887380',
 'Q16889133',
 'Q169336',
 'Q1704572',
 'Q174984',
 'Q1786828',
 'Q1865992',
 'Q187931',
 'Q1914636',
 'Q20817253',
 'Q20937557',
 'Q2095',
 'Q214609',
 'Q2150504',
 'Q2200417',
 'Q22269697',
 'Q22272508',
 'Q22294683',
 'Q22299433',
 'Q22299483',
 'Q223557',
 'Q23009552',
 'Q23009675',
 'Q2424752',
 'Q25481995',
 'Q266328',
 'Q26717101',
 'Q26907166',
 'Q2695280',
 'Q27166344',
 'Q281',
 'Q2844972',
 'Q28555911',
 'Q28728771',
 'Q28732711',
 'Q28823',
 'Q28877',
 'Q28921572',
 'Q2944660',
 'Q29651519',
 'Q2990593',
 'Q2996394',
 'Q31464082',
 'Q3249551',
 'Q337060',
 'Q34394',
 'Q3505845',
 'Q35120',
 'Q35

In [144]:
def focus_type(qnode):
    for t in focus_types.keys():
        classes = class_dict.get(qnode)
        if classes and t in classes:
            return focus_types[t]
        if qnode in country_qnodes:
            return "country"
    return "other"

Construct `country_qnodes`, the set of all country qnodes

In [104]:
country_result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$Q154CLAIMS" \
--match 'claims: (country)-[]->(), isa: (country)-[:isa]->(c), P279: (c)-[]->(:Q6256)' \
--return 'distinct country as country' 

country_qnodes = set()
for r in country_result[1:]:
    country_qnodes.add(r)

Construct `alcoholic_qnodes`, the set of all alcoholic beverage qnodes.

In [105]:
alcoholic_qnodes = set()
for line in open(os.environ["TEMP"] + "/Q154.descendant.tsv", "r"):
    alcoholic_qnodes.add(line.split("\t")[0])

In [97]:
def build_embedding_projector_vectors(embeddings_path):
    input_path = embeddings_path + "/embeddings.txt"
    vectors_path = embeddings_path + "/projector.vectors.tsv"
    qnodes_path = embeddings_path + "/projector.qnodes.tsv"

    input_file = open(input_path, "r")
    vectors_file = open(vectors_path, "w")
    qnodes_file = open(qnodes_path, "w")

    qnodes_file.write("node1\n")

    with open(input_path, "r") as w2v_file:
        next(w2v_file)
        for line in w2v_file:
            items = line.split(" ")
            qnode = items[0]
            if qnode in alcoholic_qnodes or qnode in country_qnodes:
                vectors_file.write("\t".join(items[1:]))
                qnodes_file.write("{}\n".format(qnode))

    input_file.close()
    vectors_file.close()
    qnodes_file.close()

In [98]:
build_embedding_projector_vectors(os.environ["GE"])

In [99]:
!head "$GE"/translation.projector.qnodes.tsv

node1
Q3242283
Q3866024
Q1112057
Q3866020
Q1513599
Q17329207
Q16620320
Q3895013
Q4880027


In [141]:
def build_embedding_projector_metadata(embeddings_path):
    kg_path = os.environ["OUT"] + "/parts"
    os.environ["_label_graph"] = kg_path + "/labels.en.tsv.gz"
    os.environ["_description_graph"] = kg_path + "/descriptions.en.tsv.gz"
    os.environ["_qnodes"] = embeddings_path + "/projector.qnodes.tsv"

    #result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_qnodes" \
    #--match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
    #--return 'distinct n1 as node1, lab as `node1;label`, des as `node1;description`' 
    
    result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_qnodes" \
    --match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab)' \
    --return 'distinct n1 as node1, lab as `node1;label`'
    
    metadata_path = embeddings_path + "/projector.metadata.tsv"
    metadata_file = open(metadata_path, "w")
    metadata_file.write("tag\tqnode\ttype\n")

    qnode_dict = {}
    for line in result[1:]:
        items = line.split("\t")
        qnode = items[0]
        # qnode_dict[qnode] = "{} ({})".format(items[1], items[2])
        qnode_dict[qnode] = "{}".format(items[1])

    with open(os.environ["_qnodes"]) as qnodes_file:
        next(qnodes_file)
        for line in qnodes_file:
            qnode = line[:-1]
            ftype = focus_type(qnode)
            tag = qnode_dict.get(qnode)
            if tag is None:
                tag = qnode
            tag = "{} ({})".format(qnode_dict.get(qnode), ftype)
            metadata_file.write("{}\t{}\t{}\n".format(tag, qnode, ftype))

    metadata_file.close()
    qnodes_file.close()       

In [138]:
build_embedding_projector_metadata(os.environ["GE"])

Check that the file sizes are correct, the metadata file has one more line as it as headers.

In [130]:
!wc "$GE"/projector.metadata.tsv "$GE"/projector.vectors.tsv

    2244   14157  116997 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.metadata.tsv
    2243  224300 2805636 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.vectors.tsv
    4487  238457 2922633 total


In [106]:
!head -1 "$GE"/projector.vectors.tsv

-0.695853055	-0.072303891	0.496231377	-0.293976039	0.193507940	0.096196420	0.043117594	-0.580413938	-0.423150927	0.348393738	-0.044707101	0.447685152	-0.251975268	0.192745760	-0.357472301	0.204551399	-0.013355692	0.216426134	-0.170541272	-0.189649135	-0.299910724	0.295587122	0.594068944	-0.064507566	0.261834234	-0.458304882	-0.426072240	-0.082138501	0.007850863	-0.320901960	0.727239370	0.642546177	-0.339439988	0.260855168	0.066383749	0.018122014	0.614691317	-0.109721325	-0.066969074	-0.123010576	0.231307715	0.633326292	0.570168674	-0.550969541	0.073210679	-0.459269404	0.093307532	0.358197242	0.623394549	-0.309046119	-0.467551976	0.312151939	-0.491982907	0.400699556	-0.383774340	-0.446712554	0.047239214	0.598234832	-0.471011013	-0.039659370	-0.254376531	-0.012475031	-0.207778856	0.335359454	0.302034408	0.153741017	0.902297437	-0.261785030	0.502385259	-0.139487550	0.090193652	-0.114394628	-0.246014833	-0.570263982	0.746979654	0.009215424	-0.472881168	0.205686644	-0.781571090	0.133758202	

In [112]:
build_embedding_projector_vectors(os.environ["TE"])

In [145]:
build_embedding_projector_metadata(os.environ["TE"])

In [143]:
!wc "$TE"/projector.metadata.tsv "$TE"/projector.vectors.tsv "$TE"/projector.qnodes.tsv

    2782   14542  118309 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.metadata.tsv
    2781 2847744 31710917 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.vectors.tsv
    2782    2782   24800 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.qnodes.tsv
    8345 2865068 31854026 total


In [197]:
# Q374 is vodka
kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q374.sim.tsv")

In [198]:
# Q502268 is Johnnie Walker
kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q502268.sim.tsv")

In [199]:
# Q332378 is absolut
kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q332378.sim.tsv")

In [200]:
# Q27 Ireland
kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q27.sim.tsv")

In [201]:
# Q29 Spain
kgtk_most_similar(te_vectors, positive=['Q29'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q29.sim.tsv")

In [202]:
# Q29 Spain, Q45 Portugal, Q142 France
kgtk_most_similar(te_vectors, positive=['Q29', 'Q45', 'Q142'], kg_path=os.environ['OUT'] + "/parts", topn=2000, output_path=os.environ['TE'] + "/Q29.Q45.Q142.sim.tsv")

In [203]:
# Q33 Finland
kgtk_most_similar(te_vectors, positive=['Q33'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q33.sim.tsv")

In [88]:
# Q502268 is Johnnie Walker
kgtk_most_similar(vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q502268.sim.tsv")

NameError: name 'vectors' is not defined

In [188]:
# Q502268 is Johnnie Walker
kgtk_most_similar(vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q502268.sim.tsv")

In [189]:
# Q374 is vodka
kgtk_most_similar(vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q374.sim.tsv")

In [190]:
# Q332378 is absolut
kgtk_most_similar(vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q332378.sim.tsv")

In [191]:
# Q27 Ireland
kgtk_most_similar(vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q27.sim.tsv")

In [192]:
# Q29 Spain
kgtk_most_similar(vectors, positive=['Q29'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q29.sim.tsv")

In [193]:
# Q29 Spain, Q45 Portugal, Q142 France
kgtk_most_similar(vectors, positive=['Q29', 'Q45', 'Q142'], kg_path=os.environ['OUT'] + "/parts", topn=2000, output_path=os.environ['GE'] + "/Q29.Q45.Q142.sim.tsv")

In [211]:
# Q281 whiskey
# Q282 wine
# Q3246609 mixed drink
# Q374 vodka
# Q332378 is absolut
!$kypher -i "$ISA" -i "$P279STAR" -i "$GE"/Q332378.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 \
| column -t -s $'\t'

        0.51 real         0.38 user         0.11 sys
node1     node2                label       node1;label            node1;description
Q3527971  0.4424980580806732   similarity  'Ti\\\\\\\\'Punch'@en  'cocktail'@en
Q594392   0.38892069458961487  similarity  'B-52'@en              'cocktail of coffee liqueur, Irish cream and triple sec'@en


In [41]:
lines = !kgtk remove-columns -i "$Q154LABEL" --all-except --columns node1 node2 
label_dict = {}
for line in lines[1:]:
    items = line.split("\t")
    label_dict[items[0]] = items[1]

In [42]:
lines = !kgtk remove-columns -i "$Q154DESCRIPTION" --all-except --columns node1 node2 
description_dict = {}
for line in lines[1:]:
    items = line.split("\t")
    description_dict[items[0]] = items[1]

In [77]:
def show_labels(similar_list):
    result = []
    for x in similar_list:
        text = "{}, {} ({}), {}".format(label_dict.get(x[0]), description_dict.get(x[0]), x[0], x[1])
        result.append((text))
    return result

# Stop here: the stuff below is Pedro's scratchpad, will be deleted later

### Cleanup

Remove `novalue` and `somevalue`