# Class Visualization



### Preamble: set up the environment and files used in the tutorial

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
from IPython.display import display, HTML

from graph_tool.all import *

import papermill as pm

from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [2]:
# Parameters

kgtk_path = "/Users/pedroszekely/Documents/GitHub/kgtk"

# Folder on local machine where to create the output and temporary folders
input_path = "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2/"
output_path = "/Users/pedroszekely/Downloads/kypher/projects"
graph_cache_path = "/Users/pedroszekely/Downloads/kypher/class-visualization.sqlite3.db"
project_name = "class-visualization"

Our Wikidata distribution partitions the knowledge in Wikidata into smaller files that make it possible for you to pick and choose which files you want to use. Our tutorial KG is a subset of Wikidata, and is partitioned in the same way as the full Wikidata. The following is a partial list of all the files:

In [3]:
files = [
    "p279",
    "p279star",
    "label"
]

# statistics.Pinstance_count.tsv.gz

ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  graph_cache_path=graph_cache_path,
                  project_name=project_name,
                  debug=True
                 )

User home: /Users/pedroszekely
Current dir: /Users/pedroszekely/Documents/GitHub/kgtk-tutorial-files/use-cases
KGTK dir: /Users/pedroszekely/Documents/GitHub/kgtk
Use-cases dir: /Users/pedroszekely/Documents/GitHub/kgtk/use-cases


The KGTK setup command defines environment variables for all the files so that you can reuse the Jupyter notebook when you install it on your local machine.

In [4]:
ck.print_env_variables()

USE_CASES_DIR: /Users/pedroszekely/Documents/GitHub/kgtk/use-cases
kypher: kgtk --debug query --graph-cache /Users/pedroszekely/Downloads/kypher/class-visualization.sqlite3.db
GRAPH: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2/
kgtk: kgtk --debug
KGTK_GRAPH_CACHE: /Users/pedroszekely/Downloads/kypher/class-visualization.sqlite3.db
EXAMPLES_DIR: /Users/pedroszekely/Documents/GitHub/kgtk/examples
STORE: /Users/pedroszekely/Downloads/kypher/class-visualization.sqlite3.db
KGTK_OPTION_DEBUG: false
OUT: /Users/pedroszekely/Downloads/kypher/projects/class-visualization
TEMP: /Users/pedroszekely/Downloads/kypher/projects/class-visualization/temp.class-visualization
KGTK_LABEL_FILE: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2//labels.en.tsv.gz
p279: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2//derived.P279.tsv.gz
p279star: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2//derived.P

In [5]:
ck.load_files_into_cache()

kgtk --debug query --graph-cache /Users/pedroszekely/Downloads/kypher/class-visualization.sqlite3.db -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2//derived.P279.tsv.gz" --as p279  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2//derived.P279star.tsv.gz" --as p279star  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215-dwd-v2//labels.en.tsv.gz" --as label  --limit 3
[2021-12-28 21:43:20 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [3]
---------------------------------------------
id	node1	label	node2
Q100000030-P279-Q14748-30394205-0	Q100000030	P279	Q14748
Q100000058-P279-Q1622444-bd182663-0	Q100000058	P279	Q1622444
Q1000032-P279-Q1813494-0aa0f1dc-0	Q1000032	P279	Q1813494


In [6]:
!kgtk --debug query -i p279 --idx mode:monograph --limit 5

[2021-12-28 21:43:22 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [5]
---------------------------------------------
id	node1	label	node2
Q100000030-P279-Q14748-30394205-0	Q100000030	P279	Q14748
Q100000058-P279-Q1622444-bd182663-0	Q100000058	P279	Q1622444
Q1000032-P279-Q1813494-0aa0f1dc-0	Q1000032	P279	Q1813494
Q1000032-P279-Q83602-482a1943-0	Q1000032	P279	Q83602
Q1000039-P279-Q11555767-2dddfd86-0	Q1000039	P279	Q11555767


In [7]:
!kgtk --debug query -i p279star --idx mode:monograph --limit 5

[2021-12-28 21:43:25 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_2 AS graph_2_c1
     LIMIT ?
  PARAS: [5]
---------------------------------------------
node1	label	node2	id
Q100000030	P279star	Q100000030	Q100000030-P279star-Q100000030-0000
Q100000030	P279star	Q1357761	Q100000030-P279star-Q1357761-0000
Q100000030	P279star	Q14745	Q100000030-P279star-Q14745-0000
Q100000030	P279star	Q14748	Q100000030-P279star-Q14748-0000
Q100000030	P279star	Q15401930	Q100000030-P279star-Q15401930-0000


In [8]:
!kgtk --debug query -i label --idx mode:monograph --limit 5

[2021-12-28 21:43:27 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_3 AS graph_3_c1
     LIMIT ?
  PARAS: [5]
---------------------------------------------
id	node1	label	node2
P10-label-en	P10	label	'video'@en
P1000-label-en	P1000	label	'record held'@en
P1001-label-en	P1001	label	'applies to jurisdiction'@en
P1002-label-en	P1002	label	'engine configuration'@en
P1003-label-en	P1003	label	'National Library of Romania ID'@en


## Get a list of all the classes


First get a list of all the `node1` in p279

In [9]:
kgtk("""
    query -i p279
        --match '(class)-[]->()'
        --return 'distinct class as id'
    -o $TEMP/p279.node1.tsv.gz
""")

In [10]:
!zcat < $TEMP/p279.node1.tsv.gz | wc -l

 2493245


Now get a list of all the node2 in p279

In [11]:
kgtk("""
    query -i p279
        --match '()-[]->(class)'
        --return 'distinct class as id'
    -o $TEMP/p279.node2.tsv.gz
""")

In [12]:
!zcat < $TEMP/p279.node2.tsv.gz | wc -l

  126327


In [13]:
kgtk("""
    ifnotexists --mode NONE 
        -i $TEMP/p279.node2.tsv.gz
        --filter-on $TEMP/p279.node1.tsv.gz
        --input-keys id
        --filter-keys id
    -o $TEMP/p279.classes-that-are-not-subclasses.tsv.gz
""")

In [14]:
!zcat < $TEMP/p279.classes-that-are-not-subclasses.tsv.gz | wc -l

   10700


Concatenate the files to get a list of all the classes

In [15]:
kgtk("""
    cat --mode NONE -i $TEMP/p279.node1.tsv.gz -i $TEMP/p279.classes-that-are-not-subclasses.tsv.gz
    / sort --mode NONE --column id
    -o $OUT/classes.tsv.gz
""")

In [16]:
!zcat < $OUT/classes.tsv.gz | wc -l

 2503944


## Measure the degree of classes

In [17]:
kgtk("""
    graph-statistics -i "$p279" -o $OUT/statistics.p279.tsv.gz 
    --compute-pagerank False 
    --compute-hits False 
    --page-rank-property Pdirected_pagerank 
    --vertex-in-degree-property Pindegree
    --vertex-out-degree-property Poutdegree
    --output-degrees True 
    --output-pagerank False 
    --output-hits False \
    --output-statistics-only 
    --undirected False 
    --log-file $TEMP/statistics.summary.txt
""")

In [18]:
kgtk("sort -i $OUT/statistics.p279.tsv.gz --columns node2 --numeric --reverse -o $TEMP.p279.indegree.tsv.gz")

In [19]:
kgtk("head -i $TEMP.p279.indegree.tsv.gz -n 25 / add-labels")

Unnamed: 0,node1,label,node2,id,node1;label
0,Q20747295,Pindegree,942004,Q20747295-Pindegree-19626,'protein-coding gene'@en
1,Q8054,Pindegree,764038,Q8054-Pindegree-15274,'protein'@en
2,Q7187,Pindegree,449619,Q7187-Pindegree-5566,'gene'@en
3,Q277338,Pindegree,49936,Q277338-Pindegree-220748,'pseudogene'@en
4,Q427087,Pindegree,47843,Q427087-Pindegree-197396,'non-coding RNA'@en
5,Q382617,Pindegree,40184,Q382617-Pindegree-45664,'mayor of a place in France'@en
6,Q15113603,Pindegree,40179,Q15113603-Pindegree-197900,'municipal councillor'@en
7,Q11173,Pindegree,14255,Q11173-Pindegree-638,'chemical compound'@en
8,Q64698614,Pindegree,8832,Q64698614-Pindegree-2767278,'pseudogenic transcript'@en
9,Q201448,Pindegree,8724,Q201448-Pindegree-278588,'transfer RNA'@en


In [20]:
kgtk("""
    query -i $OUT/statistics.p279.tsv.gz 
        --match '(n1)-[eid]->(degree)' 
        --where 'cast(degree, int) > 500' 
        --order-by 'cast(degree, int) desc'
""")

Unnamed: 0,node1,label,node2,id
0,Q20747295,Pindegree,942004,Q20747295-Pindegree-19626
1,Q8054,Pindegree,764038,Q8054-Pindegree-15274
2,Q7187,Pindegree,449619,Q7187-Pindegree-5566
3,Q277338,Pindegree,49936,Q277338-Pindegree-220748
4,Q427087,Pindegree,47843,Q427087-Pindegree-197396
...,...,...,...,...
63,Q1183543,Pindegree,517,Q1183543-Pindegree-330
64,Q7368,Pindegree,516,Q7368-Pindegree-2150
65,Q11415564,Pindegree,512,Q11415564-Pindegree-656
66,Q87008012,Pindegree,501,Q87008012-Pindegree-5226


### Create list of high and low `P279` degree classes 

In [182]:
kgtk("""
    query -i $OUT/statistics.p279.tsv.gz 
        --match '(n1)-[:Pindegree]->(degree)' 
        --where 'cast(degree, int) < 500' 
        --return 'n1 as node1, "few_subclasses" as node_type'
        --order-by 'cast(degree, int) desc'
    -o $OUT/class-browsing.low-degree-nodes.tsv
""")

The `class-browsing.low-degree-nodes.tsv` is simply a list of nodes:

In [183]:
kgtk("head -n 5 -i $OUT/class-browsing.low-degree-nodes.tsv")

Unnamed: 0,node1
0,Q898273
1,Q1002954
2,Q11446
3,Q22325163
4,Q79529


In [184]:
kgtk("""
    query -i $OUT/statistics.p279.tsv.gz 
        --match '(n1)-[:Pindegree]->(degree)' 
        --where 'cast(degree, int) > 499'
        --return 'n1 as node1, "many_subclasses" as node_type'
        --order-by 'cast(degree, int) desc'
    -o $OUT/class-browsing.high-degree-nodes.tsv
""")

In [189]:
kgtk("""
    cat --use-graph-cache-envar False --mode NONE -i $OUT/class-browsing.low-degree-nodes.tsv -i $OUT/class-browsing.high-degree-nodes.tsv
    -o $OUT/class-browsing.all-nodes.tsv
""")

In [190]:
kgtk("head -i $OUT/class-browsing.all-nodes.tsv -n 4")

Unnamed: 0,node1,node_type
0,Q898273,few_subclasses
1,Q1002954,few_subclasses
2,Q11446,few_subclasses
3,Q22325163,few_subclasses


In [191]:
!kgtk --debug query -i $OUT/class-browsing.all-nodes.tsv --as browsernodes --idx index:node1,node_type --limit 3

[2021-12-30 19:18:36 sqlstore]: IMPORT graph directly into table graph_43 from /Users/pedroszekely/Downloads/kypher/projects/class-visualization/class-browsing.all-nodes.tsv ...
[2021-12-30 19:18:42 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_43 AS graph_43_c1
     LIMIT ?
  PARAS: [3]
---------------------------------------------
[2021-12-30 19:18:42 sqlstore]: CREATE INDEX "graph_43_node1_node_type_idx" ON "graph_43" ("node1", "node_type")
[2021-12-30 19:18:43 sqlstore]: ANALYZE "graph_43_node1_node_type_idx"
node1	node_type
Q898273	few_subclasses
Q1002954	few_subclasses
Q11446	few_subclasses


## Create a P279star file that we will use for visualization.



### First create a complete p279star file containing all classes

First create a complete P279star file that contains all classes as our starting point. We do this because in the browser, users can click on any class.

In [24]:
kgtk("""
    reachable-nodes
        --rootfile $OUT/classes.tsv.gz
        --selflink 
        --breadth-first True
        --show-distance True
        --label P279star
        -i "$p279"
        -o $TEMP/derived.p279star.complete.tsv.gz
""")

In [25]:
kgtk("head -i $TEMP/derived.p279star.complete.tsv.gz -n 10")

Unnamed: 0,node1,label,node2,distance
0,Q100000030,P279star,Q100000030,0
1,Q100000030,P279star,Q14748,1
2,Q100000030,P279star,Q14745,2
3,Q100000030,P279star,Q1357761,3
4,Q100000030,P279star,Q2424752,3
5,Q100000030,P279star,Q31807746,3
6,Q100000030,P279star,Q8205328,3
7,Q100000030,P279star,Q223557,4
8,Q100000030,P279star,Q15401930,4
9,Q100000030,P279star,Q28877,4


The complete p279star file has only a few more edges than the default one. We should replace the original one with the complete one in any case.

In [26]:
!zcat < "$p279star" | wc -l

 87773437


In [27]:
!zcat < $TEMP/derived.p279star.complete.tsv.gz | wc -l

 87783113


Add ids and index for use in queries. The new file has a distance column, which we index too so that we can do index queries quickly.

In [28]:
kgtk("""
    add-id --id-style wikidata -i $TEMP/derived.p279star.complete.tsv.gz
    -o $OUT/derived.p279star.complete.tsv.gz
""")

In [29]:
!kgtk --debug query -i $OUT/derived.p279star.complete.tsv.gz --as p279stard --idx index:node2,node1,distance --limit 3

[2021-12-28 23:14:45 sqlstore]: DROP graph data table graph_5 from p279stard
[2021-12-28 23:16:30 sqlstore]: IMPORT graph directly into table graph_28 from /Users/pedroszekely/Downloads/kypher/projects/class-visualization/derived.p279star.complete.tsv.gz ...
[2021-12-28 23:22:35 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_28 AS graph_28_c1
     LIMIT ?
  PARAS: [3]
---------------------------------------------
[2021-12-28 23:22:35 sqlstore]: CREATE INDEX "graph_28_node2_node1_distance_idx" ON "graph_28" ("node2", "node1", "distance")
[2021-12-28 23:25:02 sqlstore]: ANALYZE "graph_28_node2_node1_distance_idx"
node1	label	node2	distance	id
Q100000030	P279star	Q100000030	0	Q100000030-P279star-Q100000030
Q100000030	P279star	Q14748	1	Q100000030-P279star-Q14748
Q100000030	P279star	Q14745	2	Q100000030-P279star-Q14745


### Count the number of subclasses 
We eventually want to build the subclass graph for each class, but some may be too large

In [30]:
kgtk("""
    query -i p279starcomplete
        --match '
            (subclass)-[]->(class)'
        --return 'class as node1, "Pcount_subclasses" as label, count(distinct subclass) as node2, class as graph'
        --where 'subclass != class'
        --order-by 'cast(node2, int) desc'
    -o $TEMP/subclass.count.tsv.gz
""")

Get an overview of the file. The top classes have an enormous number of subclasses, which will cause trouble for visualization.
Also, only 126K classes with subclasses, so there are a lot of leaf classes in Wikidata.

In the steps below we exclude the high degree classes, but that won't fix the problem as the top classes have too many subclasses anyway. Sigh. The browser will freeze and the user will be annoyed.

In [31]:
df = kgtk("""
    cat -i $TEMP/subclass.count.tsv.gz / add-labels
""")
df

Unnamed: 0,node1,label,node2,graph,node1;label,graph;label
0,Q35120,Pcount_subclasses,2461204,Q35120,'entity'@en,'entity'@en
1,Q99527517,Pcount_subclasses,2254394,Q99527517,'collection entity'@en,'collection entity'@en
2,Q28813620,Pcount_subclasses,1362927,Q28813620,'set'@en,'set'@en
3,Q16887380,Pcount_subclasses,1362452,Q16887380,'group'@en,'group'@en
4,Q488383,Pcount_subclasses,1286223,Q488383,'object'@en,'object'@en
...,...,...,...,...,...,...
126319,Q99970237,Pcount_subclasses,1,Q99970237,'anthropomorphic deer'@en,'anthropomorphic deer'@en
126320,Q99971015,Pcount_subclasses,1,Q99971015,'anthropomorphic cow or other cattle'@en,'anthropomorphic cow or other cattle'@en
126321,Q99972330,Pcount_subclasses,1,Q99972330,'video game occupation'@en,'video game occupation'@en
126322,Q99974769,Pcount_subclasses,1,Q99974769,,


### Create a subset of p279 that excludes high in-degree classes in node2

File `class-browsing.low-degree-nodes.tsv` has the class with a low number of subclasses, which we call the low degree nodes. Our low degree P279 file will have all P279 edges that arrive at a low degree class, i.e., where the superclass is a low degree class.

In [32]:
kgtk("""
    query -i p279 -i $OUT/class-browsing.low-degree-nodes.tsv
        --match '
            p279: (class)-[eid]->(superclass),
            low: (superclass)'
        --return 'class as node1, eid.label as label, superclass as node2, eid as id'
    -o $OUT/p279.lowdegree.tsv.gz
""")

In [33]:
!zcat < "$p279" | wc -l

 3077832


The low degree P279 file has many fewer edges, which is expected as the high degree classes account for a lot of edges.

In [34]:
!zcat < $OUT/p279.lowdegree.tsv.gz | wc -l

  633444


### Recompute P279star with the low degree classes
The output will be `derived.p279star.low-degree.complete.tsv.gz`

We start at all classes, and find all superclasses for them, excluding the high degree classes.

In [35]:
kgtk("""
    reachable-nodes
        --rootfile $OUT/classes.tsv.gz
        --selflink 
        --breadth-first True
        --show-distance True
        --label P279star
        -i $OUT/p279.lowdegree.tsv.gz
        -o $TEMP/derived.p279star.low-degree.complete.tsv.gz
""")

Add ids

In [36]:
kgtk("""
    add-id --id-style wikidata -i $TEMP/derived.p279star.low-degree.complete.tsv.gz
    -o $OUT/derived.p279star.low-degree.complete.tsv.gz
""")

Index using node1, node2 and distance. I wonder if we should also index the id column?

In [37]:
!kgtk --debug query -i $OUT/derived.p279star.low-degree.complete.tsv.gz --as p279starlow --idx index:node2,node1,distance --limit 3

[2021-12-28 23:38:22 sqlstore]: DROP graph data table graph_11 from p279starlow
[2021-12-28 23:38:57 sqlstore]: IMPORT graph directly into table graph_30 from /Users/pedroszekely/Downloads/kypher/projects/class-visualization/derived.p279star.low-degree.complete.tsv.gz ...
[2021-12-28 23:40:01 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_30 AS graph_30_c1
     LIMIT ?
  PARAS: [3]
---------------------------------------------
[2021-12-28 23:40:01 sqlstore]: CREATE INDEX "graph_30_node2_node1_distance_idx" ON "graph_30" ("node2", "node1", "distance")
[2021-12-28 23:40:24 sqlstore]: ANALYZE "graph_30_node2_node1_distance_idx"
node1	label	node2	distance	id
Q100000030	P279star	Q100000030	0	Q100000030-P279star-Q100000030
Q100000030	P279star	Q14748	1	Q100000030-P279star-Q14748
Q100000030	P279star	Q14745	2	Q100000030-P279star-Q14745


### Statistics to show in the graph

> We are not computing the statistics file in this notebook as it is computed in the `p1963` project. 
> We need the file here, so Pedro copied it from the `p1963` project and put it in the `$TEMP` folder

File is `statistics.Pinstance_count.tsv.gz`


In [38]:
kgtk("head -i $TEMP/statistics.Pinstance_count.tsv.gz")

Unnamed: 0,node1,label,node2,id
0,Q1000017,Pinstance_count,1,Q1000017-Pinstance_count-6b86b2
1,Q1000091,Pinstance_count,1,Q1000091-Pinstance_count-6b86b2
2,Q1000156,Pinstance_count,11,Q1000156-Pinstance_count-4fc82b
3,Q100023,Pinstance_count,1,Q100023-Pinstance_count-6b86b2
4,Q100026,Pinstance_count,1,Q100026-Pinstance_count-6b86b2
5,Q100029091,Pinstance_count,10,Q100029091-Pinstance_count-4a44dc
6,Q1000300,Pinstance_count,2,Q1000300-Pinstance_count-d4735e
7,Q100034524,Pinstance_count,3,Q100034524-Pinstance_count-4e0740
8,Q1000371,Pinstance_count,3,Q1000371-Pinstance_count-4e0740
9,Q100038174,Pinstance_count,11,Q100038174-Pinstance_count-4fc82b


In [39]:
!kgtk --debug query -i $TEMP/statistics.Pinstance_count.tsv.gz --idx mode:monograph --limit 5

[2021-12-28 23:40:30 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_12 AS graph_12_c1
     LIMIT ?
  PARAS: [5]
---------------------------------------------
node1	label	node2	id
Q1000017	Pinstance_count	1	Q1000017-Pinstance_count-6b86b2
Q1000091	Pinstance_count	1	Q1000091-Pinstance_count-6b86b2
Q1000156	Pinstance_count	11	Q1000156-Pinstance_count-4fc82b
Q100023	Pinstance_count	1	Q100023-Pinstance_count-6b86b2
Q100026	Pinstance_count	1	Q100026-Pinstance_count-6b86b2


## Compute the edge file that contains the graph we want to visualize for each class

The edge file contains `subclass / P279 / class` edges, but we add two columns to support the visualization:

- `graph:` is the id of a class we want to visualize. This columns allows us to quickly fetch all the edges to build the visualization of a class.
- `edge_type`: in the visualization we want to distinguish `subclass` and `superclass` edges so the viewer can easily distinguish subclasses and superclasses.

### Compute the subclass edges

For every class (the graph) we want to find all the P279 edges for subclasses of the given class. We use `class-browsing.low-degree-nodes.tsv` so that we don't include high degree classes that will blow up the browser.

In [40]:
kgtk(f"""
    query -i p279starlow -i p279 -i $OUT/class-browsing.low-degree-nodes.tsv
        --match '
            p279starlow: (subclass1)-[]->(class),
            p279starlow: (subclass2)-[]->(class),
            low: (subclass1),
            low: (subclass2),
            p279: (subclass1)-[]->(subclass2)'
        --return 'distinct subclass1 as node1, "P279" as label, subclass2 as node2, class as graph, "subclass" as edge_type'
    -o $TEMP/all.graph.low.sub.tsv.gz
""")

In [41]:
!zcat < $TEMP/all.graph.low.sub.tsv.gz | wc -l

 18213555


We have a lot of edges because we make copies for every graph, i.e., the same edge appears in many graphs. This is annoying, but it allows us to fetch the graphs very quickly, in less than 2 seconds.

In [42]:
kgtk("head -n 5 -i $TEMP/all.graph.low.sub.tsv.gz")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q100000030,P279,Q14748,Q14748,subclass
1,Q100000030,P279,Q14748,Q14745,subclass
2,Q100000030,P279,Q14748,Q1357761,subclass
3,Q100000030,P279,Q14748,Q2424752,subclass
4,Q100000030,P279,Q14748,Q31807746,subclass


### Compute the superclass edges

The superclass edges are also P279 edges, but they sit above the given class. We don't need to filter to low degree classes because we are going up the P279 hierarchy.

In [43]:
kgtk(f"""
    query -i p279stard -i p279
        --match '
            p279stard: (class)-[]->(superclass1),
            p279stard: (class)-[]->(superclass2),
            p279: (superclass1)-[]->(superclass2)'
        --return 'distinct superclass1 as node1, "P279" as label, superclass2 as node2, class as graph, "superclass" as edge_type'
    -o $TEMP/all.graph.low.super.tsv.gz
""")

In [44]:
!zcat < $TEMP/all.graph.low.super.tsv.gz | wc -l

 121028861


In [45]:
kgtk("head -n 5 -i $TEMP/all.graph.low.super.tsv.gz")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q95079834,P279,Q1000068,Q95079834,superclass
1,Q17372279,P279,Q100026,Q17372279,superclass
2,Q17372377,P279,Q100026,Q17372377,superclass
3,Q17372377,P279,Q100026,Q17372463,superclass
4,Q17372377,P279,Q100026,Q17372473,superclass


### Concatenate the subclass and superclass files, and store in `$TEMP/graph.low.tsv.gz`

We keep the file in `$TEMP` because for the final file we want to add he high degree nodes so that the user sees that they exist (we will not add the subclasses). Once we have the complete file, we will put it in `$OUT`.

In [46]:
kgtk(f"""
    cat --use-graph-cache-envar False -i $TEMP/all.graph.low.sub.tsv.gz -i $TEMP/all.graph.low.super.tsv.gz
    -o $TEMP/graph.low.tsv.gz
""")

Index the file to allow fast queries on all columns

In [47]:
!kgtk --debug query -i $TEMP/graph.low.tsv.gz --as graphbrowser --idx index:node1,node2,graph,edge_type --limit 3

[2021-12-29 00:04:25 sqlstore]: DROP graph data table graph_15 from graphbrowser
[2021-12-29 00:08:16 sqlstore]: IMPORT graph directly into table graph_31 from /Users/pedroszekely/Downloads/kypher/projects/class-visualization/temp.class-visualization/graph.low.tsv.gz ...
[2021-12-29 00:15:23 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_31 AS graph_31_c1
     LIMIT ?
  PARAS: [3]
---------------------------------------------
[2021-12-29 00:15:23 sqlstore]: CREATE INDEX "graph_31_node1_node2_graph_edge_type_idx" ON "graph_31" ("node1", "node2", "graph", "edge_type")
[2021-12-29 00:17:54 sqlstore]: ANALYZE "graph_31_node1_node2_graph_edge_type_idx"
node1	label	node2	graph	edge_type
Q100000030	P279	Q14748	Q14748	subclass
Q100000030	P279	Q14748	Q14745	subclass
Q100000030	P279	Q14748	Q1357761	subclass


## Compute the node file for visualization

The node file for visualization needs the labels for the nodes, and the `graph` to pull it out quickly. We add:

- `instance_count`: the number of direct instances of the class, as it is interesting for the user to see this information.

### Extract the nodes from the edge file

The reason to use the edge file is that we need the `graph` id. We do it in two steps, first extract `node1` and then extract `node2`

In [194]:
kgtk("""
    query -i label -i $TEMP/statistics.Pinstance_count.tsv.gz -i graphbrowser -i browsernodes
        --match '
            graphbrowser: (c)-[{graph: graph}]->(),
            browsernodes: (c)-[{node_type: nt}]->()'
        --opt 'label: (c)-[]->(class_label)'
        --opt 'Pinstance_count: (c)-[:Pinstance_count]->(instance_count)'
        --return 'distinct c as node1, graph as graph, coalesce(instance_count,0) as instance_count, nt as node_type, class_label as label'
    -o $TEMP/graph.low.node1.tsv.gz
""")


This is what our node file looks like:

In [195]:
kgtk("head -n 5 -i $TEMP/graph.low.node1.tsv.gz")

Unnamed: 0,node1,graph,instance_count,node_type,label
0,Q898273,Q103839965,11047,few_subclasses,'protein domain'@en
1,Q898273,Q103839987,11047,few_subclasses,'protein domain'@en
2,Q898273,Q103840002,11047,few_subclasses,'protein domain'@en
3,Q898273,Q103840059,11047,few_subclasses,'protein domain'@en
4,Q898273,Q103840066,11047,few_subclasses,'protein domain'@en


In [196]:
kgtk("""
    query -i label -i $TEMP/statistics.Pinstance_count.tsv.gz -i graphbrowser -i browsernodes
        --match '
            graphbrowser: ()-[{graph: graph}]->(c),
            browsernodes: (c)-[{node_type: nt}]->()'
        --opt 'label: (c)-[]->(class_label)'
        --opt 'Pinstance_count: (c)-[:Pinstance_count]->(instance_count)'
        --return 'distinct c as node1, graph as graph, coalesce(instance_count,0) as instance_count, nt as node_type, class_label as label'
    -o $TEMP/graph.low.node2.tsv.gz
""")

### Concatenate the two node files, deduplicate and index

To-do: try presorting the files to see if compact will run faster, as it is, this command takes over 2.5 hours

In [197]:
kgtk("""
    cat --use-graph-cache-envar False --mode NONE -i $TEMP/graph.low.node1.tsv.gz -i $TEMP/graph.low.node2.tsv.gz
    / compact --mode NONE  --columns node1 graph
    -o $TEMP/graph.low.node.tsv.gz
""")

We only need to index on `graph` as we will not do node queries on it:

## Special handling of high degree nodes

In [198]:
kgtk("head -n 5 -i $OUT/class-browsing.high-degree-nodes.tsv")

Unnamed: 0,node1
0,Q20747295
1,Q8054
2,Q7187
3,Q277338
4,Q427087


### Make a graph file with the `P279` edges where the subclass is a high degree class

Do this only to add edges that connect to the subclasses of our target node, so `class` has to be in `$TEMP/all.graph.low.sub.tsv.gz`

In [199]:
kgtk("""
    query --debug -i $OUT/class-browsing.high-degree-nodes.tsv -i p279 -i $TEMP/all.graph.low.sub.tsv.gz
        --match '
            low: (class)-[{graph: graph}]->(),
            high: (subclass),
            p279: (subclass)-[]->(class)'
        --where 'subclass != class'
        --return 'distinct subclass as node1, "P279" as label, class as node2, graph as graph, "subclass" as edge_type'
    -o $TEMP/graph.high.tsv.gz
""")

[2021-12-30 22:55:16 sqlstore]: DROP graph data table graph_33 from /Users/pedroszekely/Downloads/kypher/projects/class-visualization/class-browsing.high-degree-nodes.tsv
[2021-12-30 22:55:16 sqlstore]: IMPORT graph directly into table graph_33 from /Users/pedroszekely/Downloads/kypher/projects/class-visualization/class-browsing.high-degree-nodes.tsv ...
[2021-12-30 22:55:16 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_33_c2."node1" "_aLias.node1", ? "_aLias.label", graph_40_c1."node1" "_aLias.node2", graph_40_c1."graph" "_aLias.graph", ? "_aLias.edge_type"
     FROM graph_1 AS graph_1_c3
     INNER JOIN graph_33 AS graph_33_c2, graph_40 AS graph_40_c1
     ON graph_33_c2."node1" = graph_1_c3."node1"
        AND graph_40_c1."node1" = graph_1_c3."node2"
        AND graph_40_c1."graph" = graph_40_c1."graph"
        AND (graph_33_c2."node1" != graph_40_c1."node1")
  PARAS: ['P279', 'subclass']
---------------------------------------------


In [200]:
kgtk("head -n 5 -i $TEMP/graph.high.tsv.gz")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q10267817,P279,Q18553442,Q1225194,subclass
1,Q107715,P279,Q309314,Q246672,subclass
2,Q107715,P279,Q309314,Q937228,subclass
3,Q107715,P279,Q309314,Q7184903,subclass
4,Q107715,P279,Q309314,Q35120,subclass


### Make a node file with the high degree nodes

We use the edge file because we need to put the `graph` in the node file too.

In [201]:
kgtk("""
    query -i label -i $TEMP/statistics.Pinstance_count.tsv.gz -i $TEMP/graph.high.tsv.gz
        --match 'high: (c)-[{graph: graph}]->()'
        --opt 'label: (c)-[]->(class_label)'
        --opt 'Pinstance_count: (c)-[:Pinstance_count]->(instance_count)'
        --return 'distinct c as node1, graph as graph, coalesce(instance_count,0) as instance_count, "many_subclasses" as node_type, class_label as label'
    -o $TEMP/graph.high.node.tsv.gz
""")

In [202]:
kgtk("head -n 5 -i $TEMP/graph.high.node.tsv.gz")

Unnamed: 0,node1,graph,instance_count,node_type,label
0,Q10267817,Q1225194,1,many_subclasses,'autosomal recessive disease'@en
1,Q107715,Q246672,93,many_subclasses,'physical quantity'@en
2,Q107715,Q937228,93,many_subclasses,'physical quantity'@en
3,Q107715,Q7184903,93,many_subclasses,'physical quantity'@en
4,Q107715,Q35120,93,many_subclasses,'physical quantity'@en


Just to make sure, count the number of sublcasses of one of our supposedly high degree nodes, innocent looking with one instance, but indeed many subclasses.

In [203]:
kgtk("query -i p279 --match '(subclass)-[]->(:Q10267817)' --return 'count(distinct subclass)'")

Unnamed: 0,"count(DISTINCT graph_1_c1.""node1"")"
0,1097


In [204]:
kgtk("query -i p279 --match '(subclass)-[]->(:Q30185)' --return 'count(distinct subclass)'")

Unnamed: 0,"count(DISTINCT graph_1_c1.""node1"")"
0,2350


### Augment the low degree edge and node files with the high degree info

Concatenating without deduplication is sufficient as the files cannot have duplicate edges or nodes.

In [225]:
kgtk("""
    cat --use-graph-cache-envar False -i $TEMP/graph.high.tsv.gz -i $TEMP/graph.low.tsv.gz
    -o $OUT/class-visualization.edge.tsv.gz
""")

In [206]:
kgtk("head -n 5 -i $OUT/class-visualization.edge.tsv.gz")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q10267817,P279,Q18553442,Q1225194,subclass
1,Q107715,P279,Q309314,Q246672,subclass
2,Q107715,P279,Q309314,Q937228,subclass
3,Q107715,P279,Q309314,Q7184903,subclass
4,Q107715,P279,Q309314,Q35120,subclass


Index the file for query using the `graph` column:

In [207]:
!kgtk query -i $OUT/class-visualization.edge.tsv.gz --as classvizedge --idx index:graph --limit 3

node1	label	node2	graph	edge_type
Q10267817	P279	Q18553442	Q1225194	subclass
Q107715	P279	Q309314	Q246672	subclass
Q107715	P279	Q309314	Q937228	subclass


Concatenate the node files:

In [226]:
kgtk("""
    cat --use-graph-cache-envar False --mode NONE -i $TEMP/graph.high.node.tsv.gz -i $TEMP/graph.low.node.tsv.gz
    -o $TEMP/class-visualization.node.tsv.gz
""")

Add a tooltip with meaningful information

In [228]:
kgtk("""
    query -i $TEMP/class-visualization.node.tsv.gz
        --match '(node)-[{graph: g, instance_count: ic, node_type: nt, label: l}]->()'
        --return 'distinct
            node as node1, g as graph, ic as instance_count, nt as node_type, l as label,
            printf("%s (%s)<BR/>instance count: %s<BR/>node type: %s", kgtk_lqstring_text(l), node, cast(ic, int), nt) as tooltip'
    -o $OUT/class-visualization.node.tsv.gz
""")

In [229]:
kgtk("head -n 5 -i $OUT/class-visualization.node.tsv.gz")

Unnamed: 0,node1,graph,instance_count,node_type,label,tooltip
0,Q10267817,Q1225194,1,many_subclasses,'autosomal recessive disease'@en,autosomal recessive disease (Q10267817)<BR/>in...
1,Q107715,Q246672,93,many_subclasses,'physical quantity'@en,physical quantity (Q107715)<BR/>instance count...
2,Q107715,Q937228,93,many_subclasses,'physical quantity'@en,physical quantity (Q107715)<BR/>instance count...
3,Q107715,Q7184903,93,many_subclasses,'physical quantity'@en,physical quantity (Q107715)<BR/>instance count...
4,Q107715,Q35120,93,many_subclasses,'physical quantity'@en,physical quantity (Q107715)<BR/>instance count...


Index the file for query using the `graph` column:

In [230]:
!kgtk query -i $OUT/class-visualization.node.tsv.gz --as classviznode --idx index:graph --limit 3

node1	graph	instance_count	node_type	label	tooltip
Q10267817	Q1225194	1	many_subclasses	'autosomal recessive disease'@en	autosomal recessive disease (Q10267817)<BR/>instance count: 1<BR/>node type: many_subclasses
Q107715	Q246672	93	many_subclasses	'physical quantity'@en	physical quantity (Q107715)<BR/>instance count: 93<BR/>node type: many_subclasses
Q107715	Q937228	93	many_subclasses	'physical quantity'@en	physical quantity (Q107715)<BR/>instance count: 93<BR/>node type: many_subclasses


Temporary: we need this file for my current version of visualize because it needs labels in the edge file, the new version can have the labels in the node file

Test creation of the node file:

In [236]:
root = "Q11424"
# root="Q391342"
root="Q1420"
# root="Q1107"
# root="Q889821"
# root="Q1549591"
# root="Q188724"
# root="Q946808"
kgtk(f"""
    query -i classviznode
        --match '(class)-[{{graph: "{root}", instance_count: instance_count, label: label}}]->()'
""")

Unnamed: 0,node1,graph,instance_count,node_type,label,tooltip
0,Q11019,Q1420,99,few_subclasses,'machine'@en,machine (Q11019)<BR/>instance count: 99<BR/>no...
1,Q1183543,Q1420,198,many_subclasses,'device'@en,device (Q1183543)<BR/>instance count: 198<BR/>...
2,Q1301433,Q1420,17,few_subclasses,'land vehicle'@en,land vehicle (Q1301433)<BR/>instance count: 17...
3,Q1420,Q1420,862,many_subclasses,'motor car'@en,motor car (Q1420)<BR/>instance count: 862<BR/>...
4,Q15401930,Q1420,12,few_subclasses,'product'@en,product (Q15401930)<BR/>instance count: 12<BR/...
5,Q15618781,Q1420,29,few_subclasses,'wheeled vehicle'@en,wheeled vehicle (Q15618781)<BR/>instance count...
6,Q16686448,Q1420,24,few_subclasses,'artificial entity'@en,artificial entity (Q16686448)<BR/>instance cou...
7,Q16798631,Q1420,389,few_subclasses,'equipment'@en,equipment (Q16798631)<BR/>instance count: 389<...
8,Q223557,Q1420,110,few_subclasses,'physical object'@en,physical object (Q223557)<BR/>instance count: ...
9,Q2424752,Q1420,412,few_subclasses,'product'@en,product (Q2424752)<BR/>instance count: 412<BR/...


## Test creation of visualizations

In [234]:
roots = [
    "Q11424",
    "Q391342",
    "Q1420",
    "Q1107",
    "Q889821",
    "Q1549591",
    "Q188724",
    "Q946808",
    "Q33999",
    "Q483501",
    "Q2221906",
    "Q144",
    "Q516021",
    "Q10494269"
]

for root in roots:
    kgtk(f"""
        query -i classvizedgetest
            --match '(class)-[{{label: property, graph: "{root}", edge_type: edge_type}}]->(superclass)'
        -o $TEMP/browser/{root}.graph.low.tsv
    """)

    kgtk(f"""
        query -i classviznode
            --match '(class)-[{{graph: "{root}", instance_count: instance_count, label: label}}]->()'
        -o $TEMP/browser/{root}.node.graph.low.tsv
    """)

    # kgtk(f"""
    #     visualize-force-graph -i $TEMP/browser/{root}.graph.low.tsv
    #         --direction arrow
    #         -o $TEMP/browser/{root}.graph.low.html
    # """)

## Tests for individual files

In [224]:
kgtk("""
    query -i $TEMP/graph.low.node.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,graph,instance_count,node_type,label
0,Q11019,Q1420,99,few_subclasses,'machine'@en
1,Q1183543,Q1420,198,many_subclasses,'device'@en
2,Q1301433,Q1420,17,few_subclasses,'land vehicle'@en
3,Q1420,Q1420,862,many_subclasses,'motor car'@en
4,Q15401930,Q1420,12,few_subclasses,'product'@en
5,Q15618781,Q1420,29,few_subclasses,'wheeled vehicle'@en
6,Q16686448,Q1420,24,few_subclasses,'artificial entity'@en
7,Q16798631,Q1420,389,few_subclasses,'equipment'@en
8,Q223557,Q1420,110,few_subclasses,'physical object'@en
9,Q2424752,Q1420,412,few_subclasses,'product'@en


In [223]:
kgtk("""
    query -i $TEMP/graph.high.node.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,graph,instance_count,node_type,label


In [227]:
kgtk("""
    query -i $TEMP/class-visualization.node.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,graph,instance_count,node_type,label
0,Q11019,Q1420,99,few_subclasses,'machine'@en
1,Q1183543,Q1420,198,many_subclasses,'device'@en
2,Q1301433,Q1420,17,few_subclasses,'land vehicle'@en
3,Q1420,Q1420,862,many_subclasses,'motor car'@en
4,Q15401930,Q1420,12,few_subclasses,'product'@en
5,Q15618781,Q1420,29,few_subclasses,'wheeled vehicle'@en
6,Q16686448,Q1420,24,few_subclasses,'artificial entity'@en
7,Q16798631,Q1420,389,few_subclasses,'equipment'@en
8,Q223557,Q1420,110,few_subclasses,'physical object'@en
9,Q2424752,Q1420,412,few_subclasses,'product'@en


In [232]:
kgtk("""
    query -i classviznode
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,graph,instance_count,node_type,label,tooltip
0,Q11019,Q1420,99,few_subclasses,'machine'@en,machine (Q11019)<BR/>instance count: 99<BR/>no...
1,Q1183543,Q1420,198,many_subclasses,'device'@en,device (Q1183543)<BR/>instance count: 198<BR/>...
2,Q1301433,Q1420,17,few_subclasses,'land vehicle'@en,land vehicle (Q1301433)<BR/>instance count: 17...
3,Q1420,Q1420,862,many_subclasses,'motor car'@en,motor car (Q1420)<BR/>instance count: 862<BR/>...
4,Q15401930,Q1420,12,few_subclasses,'product'@en,product (Q15401930)<BR/>instance count: 12<BR/...
5,Q15618781,Q1420,29,few_subclasses,'wheeled vehicle'@en,wheeled vehicle (Q15618781)<BR/>instance count...
6,Q16686448,Q1420,24,few_subclasses,'artificial entity'@en,artificial entity (Q16686448)<BR/>instance cou...
7,Q16798631,Q1420,389,few_subclasses,'equipment'@en,equipment (Q16798631)<BR/>instance count: 389<...
8,Q223557,Q1420,110,few_subclasses,'physical object'@en,physical object (Q223557)<BR/>instance count: ...
9,Q2424752,Q1420,412,few_subclasses,'product'@en,product (Q2424752)<BR/>instance count: 412<BR/...


In [214]:
kgtk("""
    query -i graphbrowser
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q11019,P279,Q1183543,Q1420,superclass
1,Q11019,P279,Q39546,Q1420,superclass
2,Q1183543,P279,Q16686448,Q1420,superclass
3,Q1183543,P279,Q16798631,Q1420,superclass
4,Q1183543,P279,Q39546,Q1420,superclass
5,Q1301433,P279,Q42889,Q1420,superclass
6,Q1420,P279,Q752870,Q1420,superclass
7,Q15401930,P279,Q488383,Q1420,superclass
8,Q15618781,P279,Q1301433,Q1420,superclass
9,Q16686448,P279,Q35120,Q1420,superclass


In [215]:
kgtk("""
    query -i $TEMP/graph.high.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,label,node2,graph,edge_type


In [216]:
kgtk("""
    query -i $TEMP/graph.low.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q11019,P279,Q1183543,Q1420,superclass
1,Q11019,P279,Q39546,Q1420,superclass
2,Q1183543,P279,Q16686448,Q1420,superclass
3,Q1183543,P279,Q16798631,Q1420,superclass
4,Q1183543,P279,Q39546,Q1420,superclass
5,Q1301433,P279,Q42889,Q1420,superclass
6,Q1420,P279,Q752870,Q1420,superclass
7,Q15401930,P279,Q488383,Q1420,superclass
8,Q15618781,P279,Q1301433,Q1420,superclass
9,Q16686448,P279,Q35120,Q1420,superclass


In [217]:
kgtk("""
    query -i $TEMP/all.graph.low.sub.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,label,node2,graph,edge_type


In [218]:
kgtk("""
    query -i $TEMP/all.graph.low.super.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,label,node2,graph,edge_type
0,Q11019,P279,Q1183543,Q1420,superclass
1,Q11019,P279,Q39546,Q1420,superclass
2,Q1183543,P279,Q16686448,Q1420,superclass
3,Q1183543,P279,Q16798631,Q1420,superclass
4,Q1183543,P279,Q39546,Q1420,superclass
5,Q1301433,P279,Q42889,Q1420,superclass
6,Q1420,P279,Q752870,Q1420,superclass
7,Q15401930,P279,Q488383,Q1420,superclass
8,Q15618781,P279,Q1301433,Q1420,superclass
9,Q16686448,P279,Q35120,Q1420,superclass


In [219]:
kgtk("""
    query -i $TEMP/graph.low.node.tsv.gz
        --match '(node)-[{graph: "Q1420"}]->()'
        --order-by 'node'
""")

Unnamed: 0,node1,graph,instance_count,node_type,label
0,Q11019,Q1420,99,few_subclasses,'machine'@en
1,Q1183543,Q1420,198,many_subclasses,'device'@en
2,Q1301433,Q1420,17,few_subclasses,'land vehicle'@en
3,Q1420,Q1420,862,many_subclasses,'motor car'@en
4,Q15401930,Q1420,12,few_subclasses,'product'@en
5,Q15618781,Q1420,29,few_subclasses,'wheeled vehicle'@en
6,Q16686448,Q1420,24,few_subclasses,'artificial entity'@en
7,Q16798631,Q1420,389,few_subclasses,'equipment'@en
8,Q223557,Q1420,110,few_subclasses,'physical object'@en
9,Q2424752,Q1420,412,few_subclasses,'product'@en


### In progress: Trim the subclasses based on the levels

The idea is to also trim the graph based on the number of levels, this may be difficult as I think some small graphs may have lots of levels, and some graphs may become large with just a small number of levels.

This is our starting point:

In [70]:
kgtk("head -i $OUT/derived.p279star.complete.tsv.gz -n 5")

Unnamed: 0,node1,label,node2,distance,id
0,Q100000030,P279star,Q100000030,0,Q100000030-P279star-Q100000030
1,Q100000030,P279star,Q14748,1,Q100000030-P279star-Q14748
2,Q100000030,P279star,Q14745,2,Q100000030-P279star-Q14745
3,Q100000030,P279star,Q1357761,3,Q100000030-P279star-Q1357761
4,Q100000030,P279star,Q2424752,3,Q100000030-P279star-Q2424752


Let's look at the distribution of distances

In [71]:
kgtk("""
    query -i p279starcomplete
        --match '(class)-[eid {distance: d}]->(superclass)'
        --return 'distinct d as distance, count(eid) as count'
        --order-by 'cast(count, int) desc'
""")

Unnamed: 0,distance,count
0,6,14920344
1,4,12395081
2,5,12068280
3,3,11432165
4,7,8660425
5,2,6976960
6,8,6681393
7,9,4448827
8,1,3077658
9,0,2503943


Filter the `p279starcomplete` file to keep only the subclasses with distance < K=10

In [72]:
kgtk("""
    query -i p279stard
        --match '(subclass)-[eid {distance: d}]->(class)'
        --return 'class as node1, "Pcount_subclasses" as label, count(distinct subclass) as node2'
        --where 'subclass != class and d < 9'
        --order-by 'cast(node2, int) desc'
    -o $TEMP/subclass.count.d10.tsv.gz
""")

`kgtk add-labels` drives me crazy, as it takes sooooo long.

In [73]:
!zcat < $TEMP/subclass.count.d10.tsv.gz | head -20 | kgtk add-labels / table

zcat: error writing to output: Broken pipe
| node1     | label             | node2   | node1;label                 |
| --------- | ----------------- | ------- | --------------------------- |
| Q35120    | Pcount_subclasses | 2366995 | 'entity'@en                 |
| Q99527517 | Pcount_subclasses | 1440970 | 'collection entity'@en      |
| Q16887380 | Pcount_subclasses | 1326944 | 'group'@en                  |
| Q20937557 | Pcount_subclasses | 1255680 | 'series'@en                 |
| Q28813620 | Pcount_subclasses | 1226806 | 'set'@en                    |
| Q488383   | Pcount_subclasses | 1185270 | 'object'@en                 |
| Q4406616  | Pcount_subclasses | 1144700 | 'concrete object'@en        |
| Q223557   | Pcount_subclasses | 1136457 | 'physical object'@en        |
| Q6671777  | Pcount_subclasses | 1110651 | 'structure'@en              |
| Q58415929 | Pcount_subclasses | 1091001 | 'spatio-temporal entity'@en |
| Q219858   | Pcount_subclasses | 1056942 | 'zone'@en                