# CSKG use case

In this notebook, we will *compute statistics and embeddings over a 0.1% random sample (`cskg_sample.tsv`) of our Commonsense Knowledge Graph (CSKG).* 

This sample contains 17,234 edges.

**Note on the expected running time:** Running this notebook takes around 10 minutes on a Macbook Pro laptop with MacOS Catalina 10.15, a 2.3 GHz 8-Core Intel Core i9 processor, 2TB SSD disk, and 64 GB 2667 MHz DDR4 memory.

## Computing statistics over the graph

Let's compute graph statistics: degrees, PageRank and HITS centrality, and other general graph descriptors.

In [13]:
%%bash
kgtk graph_statistics --directed --degrees --pagerank --hits --log cskg_summary.txt ../sample_data/cskg/cskg_sample.tsv > cskg_stats.tsv

### Inspecting the output

This command has computed individual degree numbers, HITS hubs and authority values, and PageRank for all nodes in `cskg_stats.tsv`. Here are the last 10 lines of the file:

In [14]:
%%bash
tail cskg_stats.tsv

wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533	vertex_in_degree	0	wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_in_degree-128480
wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533	vertex_out_degree	1	wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_out_degree-128481
wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533	vertex_pagerank	2.5493151245983973e-05	wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_pagerank-128482
wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533	vertex_hubs	0.0	wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_hubs-128483
wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533	vertex_auth	2.9904158236518527e-249	wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_auth-128484
/c/en/marrow/n/wn/plant+wn:marrow.n.02	vertex_in_degree	1	/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_in_degree-128485
/c/en/marrow/n/wn/plant+wn:marrow.n.02	vertex_out_degree	0	/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_out_degree-128

It has also generated an aggregated summary of these and other graph statistics in `cskg_summary.txt`. Let's print the contents of this file:

In [15]:
%%bash
cat cskg_summary.txt

loading the TSV graph now ...
graph loaded! It has 25698 nodes and 17324 edges

###Top relations:
/r/RelatedTo	4552
rdfs:subClassOf	3154
vg:InImage	2996
/r/Synonym	1499
mw:PartOfSpeech	818
mw:POSForm	444
/r/Antonym	439
mw:IsPOSFormOf	438
/r/FormOf	340
/r/DerivedFrom	330

###Degrees:
in degree stats: mean=0.674138, std=0.060724, max=1
out degree stats: mean=0.674138, std=0.010146, max=1
total degree stats: mean=1.348276, std=0.061462, max=1

###PageRank
Max pageranks
29	mw:Verb	0.002626
33	mw:Noun	0.012193
22231	wd:Q8054+/c/en/polypeptide/n/wn/substance+wn:polypeptide.n.01	0.019311
21928	wd:Q20747295	0.022323
22314	wd:Q7187	0.010448

###HITS
HITS hubs
15713	vg:I2371025	0.000000
22314	wd:Q7187	0.000000
33	mw:Noun	0.000000
22231	wd:Q8054+/c/en/polypeptide/n/wn/substance+wn:polypeptide.n.01	0.000007
21928	wd:Q20747295	1.000000
HITS auth
24876	wd:Q62473855	0.031174
24875	wd:Q62469413	0.031174
24880	wd:Q62525750	0.031174
24879	wd:Q62525621	0.031174
21947	wd:Q15322928	0.031174


## Computing embeddings

Another common operation is computing BERT-large embeddings over CSKG knowledge. Here is how:

**Note**: This may take a significant amount of time (10-15 min)

In [None]:
%%bash
kgtk text_embedding ../sample_data/cskg/cskg_sample.tsv \
 --debug --embedding-projector-metadata-path none \
 --embedding-projector-metadata-path none \
 --label-properties "/r/Synonym" \
 --isa-properties "/r/IsA" \
 --description-properties "/r/DefinedAs" \
 --property-value "/r/Causes" "/r/UsedFor" \
 --has-properties "" \
 -f kgtk_format \
 --output-data-format kgtk_format \
 --use-cache \
 --model bert-large-nli-cls-token \
 > cskg_embedings.txt

You can now inspect the embeddings in `cskg_embeddings.txt`.