# Analyzing CSKG

This notebook performs various analyses on CSKG

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill Example8\ -\ Wikidata\ Subset.ipynb example8.out.ipynb \
-p cskg_path /Users/pedroszekely/Downloads/kypher/cskg \
-p kg cskg_connected.tsv.gz \
-p delete_database no 
```

### Parameters for invoking the notebook

- `cskg_path`: a folder containing the CSKG edges file and all the analysis products.
- `kg`: the name of the edge file.
- `delete_database`: whether to delete the SQL database before running the notebook: "" or "no" means don't delete it.

# Preamble

Set up paths and environment variables

In [5]:
# Parameters
cskg_path = "/Users/pedroszekely/Downloads/kypher/cskg"
kg = "cskg_connected.tsv.gz"
delete_database = "no"

In [6]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt
# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

In [7]:
os.environ['CSKG'] = cskg_path
os.environ['KG'] = "{}/{}".format(cskg_path, kg)
os.environ['NKG'] = "{}/cskg-normalized.tsv.gz".format(cskg_path, kg)
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cskg_path)
os.environ['kypher'] = "time kgtk query --graph-cache " + os.environ['STORE']
# os.environ['kypher'] = "time kgtk --debug query --graph-cache " + os.environ['STORE']

In [8]:
!echo $CSKG
!echo $KG
!echo $kypher
!echo $STORE

/Users/pedroszekely/Downloads/kypher/cskg
/Users/pedroszekely/Downloads/kypher/cskg/cskg_connected.tsv.gz
time kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/cskg/wikidata.sqlite3.db
/Users/pedroszekely/Downloads/kypher/cskg/wikidata.sqlite3.db


In [9]:
cd $cskg_path

/Users/pedroszekely/Downloads/kypher/cskg


In [156]:
if delete_database and delete_database != "no":
    print("Deleted database")
    !rm $STORE

# Utilities

In [2]:
def bar_chart(data, x_column, y_column):
    """Construct a simple bar chart with two properties"""
    bars = alt.Chart(data).mark_bar().encode(
        y=alt.Y(y_column, sort='-x'),
        x=x_column
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
        text=x_column
    )

    return (bars + text)

In [3]:
import io
import pandas
import subprocess

def shell_df(command, shell=False, **kwargs):
    """
    Takes a shell command as a string and and reads the result into a Pandas DataFrame.
    
    Additional keyword arguments are passed through to pandas.read_csv.
    
    :param command: a shell command that returns tabular data
    :type command: str
    :param shell: passed to subprocess.Popen
    :type shell: bool
    
    :return: a pandas dataframe
    :rtype: :class:`pandas.dataframe`
    """
    proc = subprocess.Popen(command, 
                            shell=shell,
                            stdout=subprocess.PIPE, 
                            stderr=subprocess.PIPE)
    output, error = proc.communicate()
    
    if proc.returncode == 0:
        if error:
            print(error.decode())
        with io.StringIO(output.decode()) as buffer:
            return pandas.read_csv(buffer, **kwargs)
    else:
        message = ("Shell command returned non-zero exit status: {0}\n\n"
                   "Command was:\n{1}\n\n"
                   "Standard error was:\n{2}")
        raise IOError(message.format(proc.returncode, command, error.decode()))

# Poking around

Print some lines to see what we have

In [10]:
!zcat < "$KG" | head | column -t -s $'\t' 

id                                                              node1                    relation       node2                         node1;label        node2;label             relation;label  relation;dimension  source                                        sentence
zcat: /c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000       /c/en/0.22_inch_calibre  /r/IsA         /c/en/5.6_millimetres         0.22 inch calibre  5.6 millimetres         is a            CN                  [[0.22 inch calibre]] is [[5.6 millimetres]]
error writing to output/c/en/0/a/wn-/r/SimilarTo-/c/en/cardinal/a/wn-0000              /c/en/0/a/wn             /r/SimilarTo   /c/en/cardinal/a/wn           0                  cardinal                similar to      CN                  [[0]] is similar to [[cardinal]]
: Broken pipe
/c/en/0/n/wn/quantity-/r/Synonym-/c/en/zero/n/wn/quantity-0000  /c/en/0/n/wn/quantity    /r/Synonym     /c/en/zero/n/wn/quantity      0                  zero                    synonym 

Normalize the file so that it is easier to process with Kypher

In [7]:
!zcat < "$KG" | head | column -t -s $'\t' 

zcat: id                                                              node1                    label          node2                         node1;label        node2;label             label;label  label;dimension  source                                        sentence
error writing to output: Broken pipe
/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000       /c/en/0.22_inch_calibre  /r/IsA         /c/en/5.6_millimetres         0.22 inch calibre  5.6 millimetres         is a         CN               [[0.22 inch calibre]] is [[5.6 millimetres]]
/c/en/0/a/wn-/r/SimilarTo-/c/en/cardinal/a/wn-0000              /c/en/0/a/wn             /r/SimilarTo   /c/en/cardinal/a/wn           0                  cardinal                similar to   CN               [[0]] is similar to [[cardinal]]
/c/en/0/n/wn/quantity-/r/Synonym-/c/en/zero/n/wn/quantity-0000  /c/en/0/n/wn/quantity    /r/Synonym     /c/en/zero/n/wn/quantity      0                  zero                    synonym      CN           

Count the number of edges and nodes

In [102]:
!$kypher -i "$KG" \
--match '(n1)-[e]->(n2)' \
--return 'count(e) as num_edges, count(distinct n1) as num_nodes, count(distinct e.relation) as num_relations, count(distinct n2) as num_values' \
| column -t -s $'\t' 

       20.31 real        13.03 user         5.04 sys
num_edges  num_nodes  num_relations  num_values
6003237    1511776    81             1031520


# Some Statistics

In [9]:
!$kypher -i "$KG" \
--match '(n1)-[e]->(n2)' \
--return 'distinct e.relation, count(distinct n1) as nodes' \
--order-by 'count(distinct n1) desc' \
| column -t -s $'\t' 

       35.79 real         8.40 user         6.64 sys
relation                      nodes
/r/RelatedTo                  554822
/r/FormOf                     376992
/r/DerivedFrom                262822
/r/IsA                        236604
/r/Synonym                    229295
/r/HasContext                 182829
/r/Antonym                    37990
/r/PartOf                     26890
at:xReact                     24312
at:xAttr                      24312
at:xWant                      24158
at:xEffect                    23255
at:xNeed                      22146
/r/EtymologicallyRelatedTo    21667
at:xIntent                    21371
/r/SimilarTo                  15834
at:oWant                      14669
at:oReact                     14070
/r/CapableOf                  10907
at:oEffect                    10895
/r/AtLocation                 9958
/r/MannerOf                   9896
/r/HasProperty                6946
/r/UsedFor                    5948
/r/LocatedNear                5728
/r/Distinc

In [51]:
command = "$kypher -i $KG \
--match '(n1)-[e]->(n2)' \
--return 'distinct e.relation, count(distinct n1) as nodes' \
--order-by 'count(distinct n1) desc'"
data = shell_df(command, shell=True, sep='\t')

        9.96 real         8.46 user         1.39 sys



In [53]:
bar_chart(data, 'nodes', 'relation')

# Clustering

First find pairs of nodes `n1` and `n2` that share a common label. To avoid outputting the cross product, test `n1 < n2`. If we do `n1 <= n2` we should also get the reflexive relation, every node equal to itself. Unfortunately, this makes the file much larger and the next commands take a very long time.

### Build the clusters

In [12]:
!$kypher -i "$KG" \
--match '(n1 {label: label})-[]->(), (n2 {label: label})-[]->()' \
--where 'n1 < n2' \
--return 'distinct n1 as node_x, n2 as node_y, "same_name" as relation, label as common_label' \
--order-by 'label' \
-o $CSKG/same_name.tsv.gz

     1160.11 real      1119.48 user        25.53 sys


In [13]:
!zcat < $CSKG/same_name.tsv.gz | wc -l

 1317833


Rename the edges and add ids so that we can use the file in KGTK

In [14]:
!kgtk rename-columns --mode NONE -i $CSKG/same_name.tsv.gz --output-columns node1 node2 relation common_label \
/ add-id --id-style node1-label-node2 -o $CSKG/same_name_edges.tsv.gz

Let's see what we got

In [15]:
!zcat < $CSKG/same_name_edges.tsv.gz | head

node1	node2	relation	common_label	id
fn:fe:abundant_entities	fn:fe:abuser	same_name		fn:fe:abundant_entities-same_name-fn:fe:abuser
fn:fe:abundant_entities	fn:fe:accessibility	same_name		fn:fe:abundant_entities-same_name-fn:fe:accessibility
fn:fe:abundant_entities	fn:fe:accoutrement	same_name		fn:fe:abundant_entities-same_name-fn:fe:accoutrement
fn:fe:abundant_entities	fn:fe:accuracy	same_name		fn:fe:abundant_entities-same_name-fn:fe:accuracy
fn:fe:abundant_entities	fn:fe:accused	same_name		fn:fe:abundant_entities-same_name-fn:fe:accused
fn:fe:abundant_entities	fn:fe:act	same_name		fn:fe:abundant_entities-same_name-fn:fe:act
fn:fe:abundant_entities	fn:fe:action	same_name		fn:fe:abundant_entities-same_name-fn:fe:action
fn:fe:abundant_entities	fn:fe:activists	same_name		fn:fe:abundant_entities-same_name-fn:fe:activists
fn:fe:abundant_entities	fn:fe:activity	same_name		fn:fe:abundant_entities-same_name-fn:fe:activity
zcat: error writing to output: Broken pipe


In [16]:
!kgtk cat --every-nth-record 10000 --initial-skip-count 1000000 -i $CSKG/same_name_edges.tsv.gz | head | column -t -s $'\t' 

node1                         node2                      relation   common_label     id
/c/en/pentobarbital           /c/en/pentobarbital/n      same_name  pentobarbital    /c/en/pentobarbital-same_name-/c/en/pentobarbital/n
/c/en/piet                    /c/en/piet/n               same_name  piet             /c/en/piet-same_name-/c/en/piet/n
/c/en/plymouth_county         /c/en/plymouth_county/n    same_name  plymouth county  /c/en/plymouth_county-same_name-/c/en/plymouth_county/n
/c/en/postracial              /c/en/postracial/a         same_name  postracial       /c/en/postracial-same_name-/c/en/postracial/a
/c/en/printout                /c/en/printout/n           same_name  printout         /c/en/printout-same_name-/c/en/printout/n
/c/en/pug/n/wn/animal         /c/en/pug/v/wikt/en_4      same_name  pug              /c/en/pug/n/wn/animal-same_name-/c/en/pug/v/wikt/en_4
/c/en/raddle                  /c/en/raddle/v/wn/contact  same_name  raddle           /c/en/raddle-same_name-/c/en/radd

Let's form cluster of all `node1` that share a commmon label. We make the common label be the identifier of the cluster, and put the nodes as members.

In [17]:
!$kypher -i $CSKG/same_name_edges.tsv.gz \
--match '(n1)-[l {common_label: common}]->()' \
--where 'common != ""' \
--return 'common as node_x, "cluster_member" as relation, n1 as node_y' \
--order-by 'common' \
-o $CSKG/temp.cluster.node1.tsv.gz 

       27.34 real        25.69 user         4.16 sys


Do the same with `node2` so that they are also members of the clusters.

In [18]:
!$kypher -i $CSKG/same_name_edges.tsv.gz \
--match '()-[l {common_label: common}]->(n2)' \
--where 'common != ""' \
--return 'common as node_x, "cluster_member" as relation, n2 as node_y' \
--order-by 'common' \
-o $CSKG/temp.cluster.node2.tsv.gz 

        5.66 real         5.29 user         0.30 sys


In [19]:
!zcat < $CSKG/temp.cluster.node1.tsv.gz | wc
!zcat < $CSKG/temp.cluster.node2.tsv.gz | wc
!zcat < $CSKG/temp.cluster.node1.tsv.gz | head | column -t -s $'\t' 
!zcat < $CSKG/temp.cluster.node2.tsv.gz | head | column -t -s $'\t' 

  853667 2713904 37415009
  853667 2713908 38917319
zcat: error writing to output: Broken pipe
node_x              relation        node_y
"meerkats" tv show  cluster_member  /c/en/meerkat/n/wn/animal
'zeros - europe'    cluster_member  /c/en/europe/n/wn/group
0                   cluster_member  /c/en/0/a/wn
0                   cluster_member  /c/en/0/a/wn
0                   cluster_member  /c/en/0/a/wn
0                   cluster_member  /c/en/0/a/wn
0                   cluster_member  /c/en/0/n/wn/quantity
0                   cluster_member  /c/en/0/n/wn/quantity
0                   cluster_member  /c/en/0/n/wp/number
zcat: node_x              relation        node_y
"meerkats" tv show  cluster_member  /c/en/television_program/n/wn/communication
'zeros - europe'    cluster_member  /c/en/nothing/n/wn/quantity
0                   cluster_member  /c/en/0/n
0                   cluster_member  /c/en/0/n/wn/quantity
error writing to output0                   cluster_member  /c/en/0/n/wp/num

We had to use `node_x` and `node_y` as the names of the columns because kypher refused to output them as `node1` and `node2`. Now we have to rename them.

In [20]:
!kgtk rename-columns --mode NONE --output-columns node1 label node2 -i $CSKG/temp.cluster.node1.tsv.gz -o $CSKG/temp.cluster.node1.renamed.tsv.gz

In [21]:
!kgtk rename-columns --mode NONE --output-columns node1 label node2 -i $CSKG/temp.cluster.node2.tsv.gz -o $CSKG/temp.cluster.node2.renamed.tsv.gz

Now concatenate the two cluster files, and add ids based on `node1/relation/node2` so that we can deduplicate later.

In [22]:
!kgtk cat -i $CSKG/temp.cluster.node1.renamed.tsv.gz -i $CSKG/temp.cluster.node2.renamed.tsv.gz \
/ add-id --id-style node1-label-node2 \
/ sort2 \
-o $CSKG/temp.name.clusters.1.tsv.gz

In [23]:
!zcat < $CSKG/temp.name.clusters.1.tsv.gz | head -10 | column -t -s $'\t' 

zcat: node1               label           node2                                        id
error writing to output"meerkats" tv show  cluster_member  /c/en/meerkat/n/wn/animal                    "meerkats" tv show-cluster_member-/c/en/meerkat/n/wn/animal
: Broken pipe
"meerkats" tv show  cluster_member  /c/en/television_program/n/wn/communication  "meerkats" tv show-cluster_member-/c/en/television_program/n/wn/communication
'zeros - europe'    cluster_member  /c/en/europe/n/wn/group                      'zeros - europe'-cluster_member-/c/en/europe/n/wn/group
'zeros - europe'    cluster_member  /c/en/nothing/n/wn/quantity                  'zeros - europe'-cluster_member-/c/en/nothing/n/wn/quantity
0 100               cluster_member  /c/en/0_100                                  0 100-cluster_member-/c/en/0_100
0 100               cluster_member  /c/en/0_100/n                                0 100-cluster_member-/c/en/0_100/n
0 4 0 0 4 0         cluster_member  /c/en/0_4_0_0_4_0            

We have lots of duplicates, so let's get rid of them using the compact command (BTW, the --presorted flag does not work even though the file was the output of `sort2`)

In [24]:
!kgtk compact -i $CSKG/temp.name.clusters.1.tsv.gz -o $CSKG/temp.name.clusters.2.tsv.gz

In [25]:
!zcat < $CSKG/temp.name.clusters.2.tsv.gz | head -10 | column -t -s $'\t' 

node1               label           node2                                        id
zcat: "meerkats" tv show  cluster_member  /c/en/meerkat/n/wn/animal                    "meerkats" tv show-cluster_member-/c/en/meerkat/n/wn/animal
"meerkats" tv show  cluster_member  /c/en/television_program/n/wn/communication  "meerkats" tv show-cluster_member-/c/en/television_program/n/wn/communication
'zeros - europe'    cluster_member  /c/en/europe/n/wn/group                      'zeros - europe'-cluster_member-/c/en/europe/n/wn/group
'zeros - europe'    cluster_member  /c/en/nothing/n/wn/quantity                  'zeros - europe'-cluster_member-/c/en/nothing/n/wn/quantity
0                   cluster_member  /c/en/0                                      0-cluster_member-/c/en/0
0                   cluster_member  /c/en/0/a/wn                                 0-cluster_member-/c/en/0/a/wn
error writing to output0                   cluster_member  /c/en/0/n                                    0-cluster_m

For fun, lets look at the cluster for `belt`

In [26]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \
--match '(cluster:`belt`)-[l]->(n2)' \
--limit 20 \
| column -t -s $'\t'

       14.09 real        11.78 user         2.99 sys
node1  label           node2                                 id
belt   cluster_member  /c/en/belt                            belt-cluster_member-/c/en/belt
belt   cluster_member  /c/en/belt/n                          belt-cluster_member-/c/en/belt/n
belt   cluster_member  /c/en/belt/n/opencyc/belt_clothing    belt-cluster_member-/c/en/belt/n/opencyc/belt_clothing
belt   cluster_member  /c/en/belt/n/opencyc/belt_mechanical  belt-cluster_member-/c/en/belt/n/opencyc/belt_mechanical
belt   cluster_member  /c/en/belt/n/opencyc/belt_region      belt-cluster_member-/c/en/belt/n/opencyc/belt_region
belt   cluster_member  /c/en/belt/n/wn/act                   belt-cluster_member-/c/en/belt/n/wn/act
belt   cluster_member  /c/en/belt/n/wn/artifact              belt-cluster_member-/c/en/belt/n/wn/artifact
belt   cluster_member  /c/en/belt/n/wn/event                 belt-cluster_member-/c/en/belt/n/wn/event
belt   cluster_member  /c/en/belt/n/wn/

In [27]:
!wd u Q134560 Q623755

[90mid[39m Q134560
[42mLabel[49m belt
[44mDescription[49m worn band or braid, usually around the waist or hips
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mcostume accessory [90m(Q1065579)[39m

[90mid[39m Q623755
[42mLabel[49m belt
[44mDescription[49m loop of flexible material used to mechanically link rotating shafts
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mdevice [90m(Q1183543)[39m


### Look at popular clusters

In [70]:
command = "$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \
--match '(cluster)-[l]-(member)' \
--return 'distinct cluster as node, count(distinct member) as count' \
--order-by 'count(distinct member) desc' \
--limit 50" 
data = shell_df(command, shell=True, sep='\t')

        1.86 real         1.52 user         0.30 sys



In [71]:
bar_chart(data, 'count', 'node')

In [29]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \
--match '(cluster:`news in brief`)-[l]->(n2)' \
--limit 10 \
| column -t -s $'\t'

        0.83 real         0.66 user         0.15 sys
node1          label           node2      id
news in brief  cluster_member  Q56836285  news in brief-cluster_member-Q56836285
news in brief  cluster_member  Q58965155  news in brief-cluster_member-Q58965155
news in brief  cluster_member  Q58965282  news in brief-cluster_member-Q58965282
news in brief  cluster_member  Q58965656  news in brief-cluster_member-Q58965656
news in brief  cluster_member  Q58965794  news in brief-cluster_member-Q58965794
news in brief  cluster_member  Q58965916  news in brief-cluster_member-Q58965916
news in brief  cluster_member  Q58966165  news in brief-cluster_member-Q58966165
news in brief  cluster_member  Q58979818  news in brief-cluster_member-Q58979818
news in brief  cluster_member  Q58979822  news in brief-cluster_member-Q58979822
news in brief  cluster_member  Q58980098  news in brief-cluster_member-Q58980098


Oh, we don't want this one.

In [30]:
!wd u Q56836285 Q58965155 Q58965282

[90mid[39m Q56836285
[42mLabel[49m news in brief
[44mDescription[49m scientific article published in Nature
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m

[90mid[39m Q58965155
[42mLabel[49m news in brief
[44mDescription[49m article publié dans la revue scientifique Nature
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m

[90mid[39m Q58965282
[42mLabel[49m news in brief
[44mDescription[49m article publié dans la revue scientifique Nature
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mscholarly article [90m(Q13442814)[39m


In [31]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \
--match '(cluster:`flute`)-[l]->(n2)' \
--limit 20 \
| column -t -s $'\t'

        0.84 real         0.68 user         0.14 sys
node1  label           node2                      id
flute  cluster_member  /c/en/flute                flute-cluster_member-/c/en/flute
flute  cluster_member  /c/en/flute/n              flute-cluster_member-/c/en/flute/n
flute  cluster_member  /c/en/flute/n/wikt/en_1    flute-cluster_member-/c/en/flute/n/wikt/en_1
flute  cluster_member  /c/en/flute/n/wikt/en_2    flute-cluster_member-/c/en/flute/n/wikt/en_2
flute  cluster_member  /c/en/flute/n/wn/artifact  flute-cluster_member-/c/en/flute/n/wn/artifact
flute  cluster_member  /c/en/flute/v/wikt/en_1    flute-cluster_member-/c/en/flute/v/wikt/en_1
flute  cluster_member  /c/en/flute/v/wn/contact   flute-cluster_member-/c/en/flute/v/wn/contact
flute  cluster_member  Q89192698                  flute-cluster_member-Q89192698
flute  cluster_member  Q89192704                  flute-cluster_member-Q89192704
flute  cluster_member  Q89192707                  flute-cluster_member-Q89192707
flute

Hmm, those specific flutes probably don't belong in CSKG

In [32]:
!wd u Q89192698 Q89192704 Q89192707

[90mid[39m Q89192698
[42mLabel[49m flute
[44mDescription[49m Flute, Johann Georg Braun, Mannheim, 1816–1833
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mflute [90m(Q5462939)[39m | flute [90m(Q11405)[39m

[90mid[39m Q89192704
[42mLabel[49m flute
[44mDescription[49m Flute, Cortellini, Turin, second quarter of 19th century
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mflute [90m(Q5462939)[39m | flute [90m(Q11405)[39m

[90mid[39m Q89192707
[42mLabel[49m flute
[44mDescription[49m Flute, Cornelius Ward, London, c. 1842
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mflute [90m(Q5462939)[39m | flute [90m(Q11405)[39m


In [33]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \
--match '(cluster:`break`)-[l]->(n2)' \
--limit 100 \
| column -t -s $'\t'

        0.87 real         0.68 user         0.16 sys
node1  label           node2                              id
break  cluster_member  /c/en/absconder/n/wn/person        break-cluster_member-/c/en/absconder/n/wn/person
break  cluster_member  /c/en/american_civil_war/n/wn/act  break-cluster_member-/c/en/american_civil_war/n/wn/act
break  cluster_member  /c/en/break                        break-cluster_member-/c/en/break
break  cluster_member  /c/en/break/n/wikt/en_1            break-cluster_member-/c/en/break/n/wikt/en_1
break  cluster_member  /c/en/break/n/wikt/en_2            break-cluster_member-/c/en/break/n/wikt/en_2
break  cluster_member  /c/en/break/n/wn/geology           break-cluster_member-/c/en/break/n/wn/geology
break  cluster_member  /c/en/break/n/wn/state             break-cluster_member-/c/en/break/n/wn/state
break  cluster_member  /c/en/break/n/wn/tennis            break-cluster_member-/c/en/break/n/wn/tennis
break  cluster_member  /c/en/break/n/wn/time              br

In [34]:
!wd u Q1681122 Q2707973 Q55398038

[90mid[39m Q1681122
[42mLabel[49m break
[44mDescription[49m period of time during a shift in which an employee is allowed to take time off
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mtime interval [90m(Q186081)[39m

[90mid[39m Q2707973
[42mLabel[49m break
[44mDescription[49m tennis
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39msports terminology [90m(Q28829877)[39m

[90mid[39m Q55398038
[42mLabel[49m break
[44mDescription[49m in cue sports



# Relations among clusters

In [35]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \
--match 'clusters: (cluster:`ice cream`)-[l]->(n2), cskg: (n2)-[rid]->(object)' \
--return 'distinct rid.relation as relation, object as value' \
--order-by 'rid.relation' \
--limit 10 \
| column -t -s $'\t'

        1.71 real         1.32 user         0.32 sys
relation       value
/r/AtLocation  /c/en/disneyland
/r/AtLocation  /c/en/freezer
/r/AtLocation  /c/en/movie
/r/AtLocation  /c/en/party
/r/CapableOf   /c/en/delight_child
/r/CapableOf   /c/en/melt
/r/CapableOf   /c/en/taste_sweet
/r/CapableOf   /c/en/earth_science/n/wn/cognition
/r/CapableOf   /c/en/melt/v/wn/change
/r/CapableOf   /c/en/scoop/v/wn/contact


In [37]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \
--match 'clusters: (cluster)-[l]->(n2), cskg: (n2)-[rid]->(object), clusters: (word)-[]->(object)' \
--return 'distinct cluster as subject, rid.relation as relation, word as value, rid.source as source' \
--order-by 'cluster, rid.relation, rid.source, word' \
-o $CSKG/relations.tsv.gz

      151.07 real       109.45 user        10.77 sys


In [38]:
!zcat < $CSKG/relations.tsv.gz | wc

 10931565 49980020 424067168


In [39]:
!zcat < $CSKG/relations.tsv.gz | head | column -t -s $'\t'

zcat: error writing to output: Broken pipe
subject             relation        value               source
"meerkats" tv show  /r/CapableOf    glance              VG
"meerkats" tv show  /r/IsA          "meerkats" tv show  CN|WN
"meerkats" tv show  /r/IsA          broadcast           CN|WN
"meerkats" tv show  /r/IsA          meerkat             CN|WN
"meerkats" tv show  /r/IsA          network             CN|WN
"meerkats" tv show  /r/IsA          slightly            CN|WN
"meerkats" tv show  /r/LocatedNear  television          VG
"meerkats" tv show  /r/LocatedNear  tv                  VG
"meerkats" tv show  /r/PartOf       "meerkats" tv show  WN


In [63]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \
--match 'clusters: (cluster)-[l]->(n2), cskg: (n2)-[rid {relation: rel_label}]->(object), clusters: (word)-[]->(object)' \
--return 'distinct rid.source as source, cluster as subject, rel_label as `relation id`, word as value, rel_label.label as relation' \
--order-by 'cluster, rid.relation, rid.source, word' \
-o $CSKG/relations-detailed.tsv.gz

      166.52 real       147.83 user         7.20 sys


In [64]:
!zcat < $CSKG/relations-detailed.tsv.gz | wc

 12626473 102940980 903658537


In [65]:
!zcat < $CSKG/relations-detailed.tsv.gz | head | column -t -s $'\t'

zcat: source  subject             relation id     value               relation
VG      "meerkats" tv show  /r/CapableOf    glance              capable of
CN|WN   "meerkats" tv show  /r/IsA          "meerkats" tv show  is a
error writing to outputCN|WN   "meerkats" tv show  /r/IsA          broadcast           is a
: CN|WN   "meerkats" tv show  /r/IsA          meerkat             is a
Broken pipe
CN|WN   "meerkats" tv show  /r/IsA          network             is a
CN|WN   "meerkats" tv show  /r/IsA          slightly            is a
VG      "meerkats" tv show  /r/LocatedNear  television          on front of|playing on|written on
VG      "meerkats" tv show  /r/LocatedNear  tv                  on front of|playing on|written on
WN      "meerkats" tv show  /r/PartOf       "meerkats" tv show  is a part of


In [60]:
!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \
--match 'clusters: (cluster:`teddy bear`)-[l]->(n2), cskg: (n2)-[rid {relation: rel_label}]->(object), clusters: (word)-[]->(object)' \
--return 'distinct rid.source as source, cluster as subject, rel_label as `relation id`, word as value, rel_label.label as relation' \
--order-by 'cluster, rid.relation, rid.source, word' \
--limit 250 \
| column -t -s $'\t'

        1.13 real         0.74 user         0.20 sys
source  subject     relation id     value                                   relation
CN      teddy bear  /r/AtLocation   bed                                     at location
CN      teddy bear  /r/AtLocation   home                                    at location
CN      teddy bear  /r/AtLocation   shelf                                   at location
VG      teddy bear  /r/CapableOf    alive                                   capable of
VG      teddy bear  /r/CapableOf    appear                                  capable of
VG      teddy bear  /r/CapableOf    asleep                                  capable of
VG      teddy bear  /r/CapableOf    bat                                     capable of
VG      teddy bear  /r/CapableOf    be                                      capable of
VG      teddy bear  /r/CapableOf    beamish                                 capable of
VG      teddy bear  /r/CapableOf    bear                                    

In [59]:
command = "$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \
--match 'clusters: (cluster:`teddy bear`)-[l]->(n2), cskg: (n2)-[rid {relation: rel_label}]->(object), clusters: (word)-[]->(object)' \
--return 'distinct rid.source as source, cluster as subject, rel_label as relation, count(rel_label) as count' \
--order-by 'cluster, count(rel_label) desc, rid.relation, rid.source' \
--limit 250" 
data = shell_df(command, shell=True, sep='\t')
bar_chart(data, 'nodes', 'relation')

        0.93 real         0.74 user         0.16 sys
source  subject     relation            count
VG      teddy bear  /r/LocatedNear      1990
VG      teddy bear  mw:MayHaveProperty  240
VG      teddy bear  /r/CapableOf        117
CN      teddy bear  /r/Synonym          7
CN|WN   teddy bear  /r/IsA              6
CN      teddy bear  /r/AtLocation       3
CN      teddy bear  /r/IsA              2
CN      teddy bear  /r/RelatedTo        2
WD      teddy bear  /r/HasContext       1
WD      teddy bear  /r/IsA              1
CN      teddy bear  /r/UsedFor          1


In [44]:
!kgtk sort2 --help

usage: kgtk sort2 [-h] [-i INPUT] [-o OUTPUT_FILE]
                  [-c [COLUMNS [COLUMNS ...]]] [--locale LOCALE]
                  [-r [True|False]] [--pure-python [True|False]] [-X EXTRA]
                  [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input-file INPUT
                        Input file to sort. (May be omitted or '-' for stdin.)
  -o OUTPUT_FILE, --out OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output file to write to. (May be omitted or '-' for
                        stdout.)
  -c [COLUMNS [COLUMNS ...]], --column [COLUMNS [COLUMNS ...]], --columns [COLUMNS [COLUMNS ...]]
                        comma-separated list of column names to sort on.
                        (defaults to id for node files, (node1, label, node2)
                        for edge files without ID, (id, node1, label, node2)
                        for edge files with ID)
  --locale LOCALE       LC_ALL locale controls the s

In [43]:
!kgtk --help

usage: kgtk [options] command [ / command]*

kgtk --- Knowledge Graph Toolkit

positional arguments:
  command
    add-id              Copy a KGTK file, adding ID values.
    calc                Perform calculations on KGTK file columns.
    cat                 Concatenate KGTK files.
    clean-data          Validate a KGTK file and output a clean copy: no
                        comments, whitespace lines, invalid lines, etc.
    compact             Copy a KGTK file compacting | lists.
    connected-components
                        Find connected components in a Graph.
    expand              Copy a KGTK file expanding | lists.
    explode (denormalize_node2)
                        Copy a KGTK file, exploding one column (usualy node2)
                        into seperate columns for each subfield.
    export-gt           Export a KGTK file to Graph-tool format.
    export-neo4j        Exports data to Neo4J Cypher Query Language
                        statements.
    export-wikida