# Analyzing CSKG

This notebook performs various analyses on CSKG

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill Example8\ -\ Wikidata\ Subset.ipynb example8.out.ipynb \
-p cskg_path /Users/pedroszekely/Downloads/kypher/cskg \
-p kg cskg_connected.tsv.gz \
-p delete_database no 
```

### Parameters for invoking the notebook

- `cskg_path`: a folder containing the CSKG edges file and all the analysis products.
- `kg`: the name of the edge file.
- `delete_database`: whether to delete the SQL database before running the notebook: "" or "no" means don't delete it.

# Preamble

Set up paths and environment variables

In [None]:
# Parameters
cskg_path = "/Users/pedroszekely/Downloads/kypher/cskg"
kg = "cskg_connected.kgtk.gz"
delete_database = "yes"

In [2]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

In [3]:
os.environ['CSKG'] = cskg_path
os.environ['KG'] = "{}/{}".format(cskg_path, kg)
os.environ['NKG'] = "{}/cskg-normalized.kgtk.gz".format(cskg_path, kg)
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cskg_path)
os.environ['kypher'] = "time kgtk query --graph-cache " + os.environ['STORE']
# os.environ['kypher'] = "time kgtk --debug query --graph-cache " + os.environ['STORE']

In [4]:
!echo $CSKG
!echo $KG
!echo $NKG
!echo $kypher
!echo $STORE

/Users/pedroszekely/Downloads/kypher/cskg
/Users/pedroszekely/Downloads/kypher/cskg/cskg_connected.kgtk.gz
/Users/pedroszekely/Downloads/kypher/cskg/cskg-normalized.kgtk.gz
time kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/cskg/wikidata.sqlite3.db
/Users/pedroszekely/Downloads/kypher/cskg/wikidata.sqlite3.db


In [5]:
cd $cskg_path

/Users/pedroszekely/Downloads/kypher/cskg


In [6]:
if delete_database and delete_database != "no":
    print("Deleted database")
    !rm $STORE

Deleted database


# Utilities

In [7]:
def bar_chart(data, x_column, y_column, title="", width=800):
    """Construct a simple bar chart with two properties"""
    bars = alt.Chart(data).mark_bar().encode(
        y=alt.Y(y_column, sort='-x'),
        x=x_column
    ).properties(
        title=title,
        width=width
    )

    text = bars.mark_text(
        align='left',
        baseline='middle',
        dx=3  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
        text=x_column
    )

    return (bars + text)

In [8]:
import io
import pandas
import subprocess

def shell_df(command, shell=False, **kwargs):
    """
    Takes a shell command as a string and and reads the result into a Pandas DataFrame.
    
    Additional keyword arguments are passed through to pandas.read_csv.
    
    :param command: a shell command that returns tabular data
    :type command: str
    :param shell: passed to subprocess.Popen
    :type shell: bool
    
    :return: a pandas dataframe
    :rtype: :class:`pandas.dataframe`
    """
    proc = subprocess.Popen(command, 
                            shell=shell,
                            stdout=subprocess.PIPE, 
                            stderr=subprocess.PIPE)
    output, error = proc.communicate()
    
    if proc.returncode == 0:
        if error:
            print(error.decode())
        with io.StringIO(output.decode()) as buffer:
            return pandas.read_csv(buffer, **kwargs)
    else:
        message = ("Shell command returned non-zero exit status: {0}\n\n"
                   "Command was:\n{1}\n\n"
                   "Standard error was:\n{2}")
        raise IOError(message.format(proc.returncode, command, error.decode()))

# Poking around

Print some lines to see what we have

In [9]:
!zcat < "$KG" | head | column -t -s $'\t' 

zcat: id                                                         node1                    relation        node2                             node1;label          node2;label                   relation;label  relation;dimension  source                                                  sentence
/c/en/0-/r/DefinedAs-/c/en/empty_set-0000                  /c/en/0                  /r/DefinedAs    /c/en/empty_set                   "0"                  "empty set"                   "defined as"    "CN"                "[[0]] is the [[empty set]]."
error writing to output/c/en/0-/r/DefinedAs-/c/en/first_limit_ordinal-0000        /c/en/0                  /r/DefinedAs    /c/en/first_limit_ordinal         "0"                  "first limit ordinal"         "defined as"    "CN"                "[[0]] is the [[first limit ordinal]]."
: Broken pipe
/c/en/0-/r/DefinedAs-/c/en/number_zero-0000                /c/en/0                  /r/DefinedAs    /c/en/number_zero                 "0"                  "num

Normalize the file so that it is easier to process with Kypher

In [10]:
!kgtk normalize --verbose -i $KG -o $CSKG/temp.cskg.normalize.1.kgtk.gz --columns-to-lower 'relation;dimension' source sentence 'node1;label' 'relation;label' 'node2;label'

Opening the input file: /Users/pedroszekely/Downloads/kypher/cskg/cskg_connected.kgtk.gz
KgtkReader: File_path.suffix: .gz
KgtkReader: reading gzip /Users/pedroszekely/Downloads/kypher/cskg/cskg_connected.kgtk.gz
header: id	node1	relation	node2	node1;label	node2;label	relation;label	relation;dimension	source	sentence
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Node1 column name: node1
Label column name: relation
Node2 column name: node2
Id column name: id
The following columns will be lowered or normalized
 node1;label from node1 (label 'label')
 node2;label from node2 (label 'label')
 relation;label from relation (label 'label')
 relation;dimension from relation (label 'dimension')
 source from id (label 'source')
 sentence from id (label 'sentence')
The output columns are: id node1 relation node2
Opening the output file: /Users/pedroszekely/Downloads/kypher/cskg/temp.cskg.normalize.1.kgtk.gz
File_path.suffix: .gz
KgtkWriter: writing gzi

In [11]:
!zcat < $CSKG/temp.cskg.normalize.1.kgtk.gz | head | column -t -s $'\t' 

zcat: error writing to output: Broken pipe
id                                                   node1     relation                                 node2
/c/en/0-/r/DefinedAs-/c/en/empty_set-0000            /c/en/0   /r/DefinedAs                             /c/en/empty_set
/c/en/0-/r/DefinedAs-/c/en/empty_set-0000            source    "CN"
/c/en/0-/r/DefinedAs-/c/en/empty_set-0000            sentence  "[[0]] is the [[empty set]]."
/c/en/0                                              label     "0"
/r/DefinedAs                                         label     "defined as"
/c/en/empty_set                                      label     "empty set"
/c/en/0-/r/DefinedAs-/c/en/first_limit_ordinal-0000  /c/en/0   /r/DefinedAs                             /c/en/first_limit_ordinal
/c/en/0-/r/DefinedAs-/c/en/first_limit_ordinal-0000  source    "CN"
/c/en/0-/r/DefinedAs-/c/en/first_limit_ordinal-0000  sentence  "[[0]] is the [[first limit ordinal]]."


Rename the columns to the standard `node1/label/node2` and add ids

In [12]:
!kgtk rename-columns --mode NONE -i $CSKG/temp.cskg.normalize.1.kgtk.gz --output-columns id node1 label node2 \
/ add-id --id-style node1-label-node2 -o $NKG

Count the number of edges and nodes

In [13]:
!$kypher -i $NKG \
--match '(n1)-[e]->(n2)' \
--return 'count(e) as num_edges, count(distinct n1) as num_nodes, count(distinct e.label) as num_relations, count(distinct n2) as num_values' \
| column -t -s $'\t' 

      220.69 real       269.23 user        25.36 sys
num_edges  num_nodes  num_relations  num_values
15120425   8164285    84             3285154


# Some Statistics

In [14]:
command = "$kypher -i $NKG \
--match '(n1)-[e]->(n2)' \
--return 'distinct e.label, count(distinct n1) as nodes' \
--order-by 'count(distinct n1) desc'"
stats = shell_df(command, shell=True, sep='\t')

       29.24 real        22.94 user         5.21 sys



In [15]:
bar_chart(stats[:20], 'nodes', 'label', title="Relations in CSKG")

### Distribution of edges in each data source

In [16]:
command = "$kypher -i $NKG \
--match '(n1)-[r]->(n2), (r)-[l:source]->(s)' \
--return 's as source, count(distinct r) as `count of relations`' \
--order-by 'count(distinct r) desc'"
data = shell_df(command, shell=True, sep='\t')
bar_chart(data, 'count of relations', 'source', title="CSKG Relation Counts")

       63.68 real        50.39 user        10.79 sys



Compute the distribuiton of relations in each data source

In [17]:
command = "$kypher -i $NKG \
--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source:`\"SOURCE\"`)' \
--return 'label as relation, count(distinct n1) as count' \
--order-by 'count(distinct n1) desc'"
datasets = []
for source in ["CN", "WN", "AT", "VG", "FN", "WD", "RG"]:
    data = shell_df(command.replace("SOURCE", source), shell=True, sep='\t')
    datasets.append(data)

       33.97 real        25.07 user         5.16 sys

        1.79 real         1.18 user         0.57 sys

        5.39 real         2.51 user         1.05 sys

        2.46 real         1.21 user         0.47 sys

        0.88 real         0.59 user         0.15 sys

        1.37 real         0.81 user         0.21 sys

        6.86 real         4.12 user         1.23 sys



In [18]:
bar_chart(datasets[0], 'count', 'relation', title="ConceptNet: Count Of Relations", width=200)

In [19]:
bar_chart(datasets[1], 'count', 'relation', title="WordNet: Count Of Relations", width=200)

In [20]:
bar_chart(datasets[2], 'count', 'relation', title="Atomic: Count Of Relations", width=200)

In [21]:
bar_chart(datasets[3], 'count', 'relation', title="Visual Genome: Count Of Relations", width=200)

In [22]:
bar_chart(datasets[4], 'count', 'relation', title= "FrameNet: Count Of Relations", width=200)

In [23]:
bar_chart(datasets[5], 'count', 'relation', "Wikidata: Count Of Relations", width=200)

In [24]:
bar_chart(datasets[6], 'count', 'relation', title="Roget: Count Of Relations", width=200)

### ConceptNet nodes that contain `catch` and `throw`

In [32]:
catch = !$kypher -i $NKG \
--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source)' \
--where 'source = $s and n1 =~ ".*/catch/.*"' \
--return 'distinct n1 as node1' \
--spara s='CN' \
--limit 100

for n in catch[1:-1]:
    print("https://www.conceptnet.io"+n)

https://www.conceptnet.io/c/en/catch/n
https://www.conceptnet.io/c/en/catch/n/wn/act
https://www.conceptnet.io/c/en/catch/n/wn/artifact
https://www.conceptnet.io/c/en/catch/n/wn/attribute
https://www.conceptnet.io/c/en/catch/n/wn/object
https://www.conceptnet.io/c/en/catch/n/wn/person
https://www.conceptnet.io/c/en/catch/n/wn/quantity
https://www.conceptnet.io/c/en/catch/v
https://www.conceptnet.io/c/en/catch/v/wn/body
https://www.conceptnet.io/c/en/catch/v/wn/competition
https://www.conceptnet.io/c/en/catch/v/wn/emotion
https://www.conceptnet.io/c/en/catch/v/wn/motion
https://www.conceptnet.io/c/en/catch/v/wn/perception
https://www.conceptnet.io/c/en/catch/v/wn/possession
https://www.conceptnet.io/c/en/catch/v/wn/social


In [26]:
throw = !$kypher -i $NKG \
--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source)' \
--where 'source in [$cn] and n1 =~ ".*/throw?\\b.*"' \
--return 'distinct n1 as node1' \
--spara cn='CN' --spara wd='WD' --spara vg='VG'\
--limit 100

for n in throw[1:-1]:
    print("https://www.conceptnet.io"+n)

https://www.conceptnet.io/c/en/thro
https://www.conceptnet.io/c/en/thro/a/wikt/en_2
https://www.conceptnet.io/c/en/throw
https://www.conceptnet.io/c/en/throw/n
https://www.conceptnet.io/c/en/throw/n/wikt/en_1
https://www.conceptnet.io/c/en/throw/n/wikt/en_2
https://www.conceptnet.io/c/en/throw/n/wikt/en_3
https://www.conceptnet.io/c/en/throw/n/wikt/en_4
https://www.conceptnet.io/c/en/throw/n/wn/artifact
https://www.conceptnet.io/c/en/throw/n/wn/event
https://www.conceptnet.io/c/en/throw/n/wn/state
https://www.conceptnet.io/c/en/throw/n/wp/grappling
https://www.conceptnet.io/c/en/throw/v
https://www.conceptnet.io/c/en/throw/v/wikt/en_1
https://www.conceptnet.io/c/en/throw/v/wikt/en_2
https://www.conceptnet.io/c/en/throw/v/wn/cognition
https://www.conceptnet.io/c/en/throw/v/wn/communication
https://www.conceptnet.io/c/en/throw/v/wn/emotion


### ConceptNet nodes that contain `dog`

In [27]:
dogs = !$kypher -i $NKG \
--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source)' \
--where 'source in [$cn] and n1 =~ ".*/dogs?\\b.*"' \
--return 'distinct n1 as node1' \
--spara cn='CN' \
--limit 100

for n in dogs[1:-1]:
    print("https://www.conceptnet.io"+n)

https://www.conceptnet.io/c/en/boxer/n/wp/dog
https://www.conceptnet.io/c/en/coat/n/wp/dog
https://www.conceptnet.io/c/en/dog
https://www.conceptnet.io/c/en/dog's/n
https://www.conceptnet.io/c/en/dog's_abuse/n
https://www.conceptnet.io/c/en/dog's_acoustical_sense
https://www.conceptnet.io/c/en/dog's_age/n
https://www.conceptnet.io/c/en/dog's_bark
https://www.conceptnet.io/c/en/dog's_bollocks
https://www.conceptnet.io/c/en/dog's_bollocks/n
https://www.conceptnet.io/c/en/dog's_breakfast
https://www.conceptnet.io/c/en/dog's_breakfast/n
https://www.conceptnet.io/c/en/dog's_breakfast/n/wn/state
https://www.conceptnet.io/c/en/dog's_breakfasts/n
https://www.conceptnet.io/c/en/dog's_chance
https://www.conceptnet.io/c/en/dog's_chance/n
https://www.conceptnet.io/c/en/dog's_cherries/n
https://www.conceptnet.io/c/en/dog's_cherry
https://www.conceptnet.io/c/en/dog's_dinner/n
https://www.conceptnet.io/c/en/dog's_dinner/n/wn/state
https://www.conceptnet.io/c/en/dog's_ear/v
https://www.conceptnet.io/c

In [28]:
frisbee = !$kypher -i $NKG \
--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source)' \
--where 'source in [$cn] and n1 =~ ".*/frisbees?\\b.*"' \
--return 'distinct n1 as node1' \
--spara cn='CN' --spara wd='WD' --spara vg='VG'\
--limit 100

for n in frisbee[1:-1]:
    print("https://www.conceptnet.io"+n)

https://www.conceptnet.io/c/en/frisbee
https://www.conceptnet.io/c/en/frisbee/n
https://www.conceptnet.io/c/en/frisbee/v
https://www.conceptnet.io/c/en/frisbees
https://www.conceptnet.io/c/en/frisbees/n
https://www.conceptnet.io/c/en/frisbees/v


### Get all the nodes in ConceptNet, Wikidata-CS and VisualGenone that contain `frisbee`

In [29]:
!$kypher -i $NKG \
--match '(n1)-[r {label: label}]->(n2), (r)-[:source]->(source), (n1)-[:label]->(n1_label)' \
--where 'source in [$cn, $vg, $wd] and n1 =~ ".*frisbees?.*"' \
--return 'distinct source as source, n1 as node1, n1_label as `node1 label`, label as relation, n2 as node2' \
--order-by 'source, n1' \
--spara cn='CN' --spara wd='WD' --spara vg='VG'\
-o $CSKG/frisbee.tsv

      279.25 real       238.73 user        12.22 sys


We have many edges that relate to `frisbee`

In [30]:
!wc $CSKG/frisbee.tsv

   12828   73835  953300 /Users/pedroszekely/Downloads/kypher/cskg/frisbee.tsv


In [31]:
!head $CSKG/frisbee.tsv | column -t -s $'\t' 

source  node1                         node1 label               relation            node2
"CN"    /c/en/capturing_frisbee       "capturing frisbee"       /r/HasPrerequisite  /c/en/coordination
"CN"    /c/en/dogs_catching_frisbees  "dogs catching frisbees"  /r/AtLocation       /c/en/park
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/air
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/deadhead's_van
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/frisbee_golf_course
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/park
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/roof
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/toy_chest
"CN"    /c/en/frisbee                 "frisbee"                 /r/AtLocation       /c/en/tree
