{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CSKG use case\n", "\n", "In this notebook, we will *compute statistics and embeddings over a 0.1% random sample (`cskg_sample.tsv`) of our Commonsense Knowledge Graph (CSKG).* \n", "\n", "This sample contains 17,234 edges.\n", "\n", "**Note on the expected running time:** Running this notebook takes around 10 minutes on a Macbook Pro laptop with MacOS Catalina 10.15, a 2.3 GHz 8-Core Intel Core i9 processor, 2TB SSD disk, and 64 GB 2667 MHz DDR4 memory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing statistics over the graph\n", "\n", "Let's compute graph statistics: degrees, PageRank and HITS centrality, and other general graph descriptors." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kgtk graph_statistics --directed --degrees --pagerank --hits --log cskg_summary.txt -i sample_data/cskg/cskg_sample.tsv > cskg_stats.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspecting the output\n", "\n", "This command has computed individual degree numbers, HITS hubs and authority values, and PageRank for all nodes in `cskg_stats.tsv`. Here are the last 10 lines of the file:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533\tvertex_in_degree\t0\twn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_in_degree-128480\n", "wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533\tvertex_out_degree\t1\twn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_out_degree-128481\n", "wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533\tvertex_pagerank\t2.5493151245983973e-05\twn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_pagerank-128482\n", "wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533\tvertex_hubs\t0.0\twn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_hubs-128483\n", "wn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533\tvertex_auth\t2.9904158236518527e-249\twn:zucchini.n.01+/c/en/zucchini/n/wn/plant+wd:Q7533-vertex_auth-128484\n", "/c/en/marrow/n/wn/plant+wn:marrow.n.02\tvertex_in_degree\t1\t/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_in_degree-128485\n", "/c/en/marrow/n/wn/plant+wn:marrow.n.02\tvertex_out_degree\t0\t/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_out_degree-128486\n", "/c/en/marrow/n/wn/plant+wn:marrow.n.02\tvertex_pagerank\t4.7162310305776544e-05\t/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_pagerank-128487\n", "/c/en/marrow/n/wn/plant+wn:marrow.n.02\tvertex_hubs\t9.592664809864673e-248\t/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_hubs-128488\n", "/c/en/marrow/n/wn/plant+wn:marrow.n.02\tvertex_auth\t0.0\t/c/en/marrow/n/wn/plant+wn:marrow.n.02-vertex_auth-128489\n" ] } ], "source": [ "%%bash\n", "tail cskg_stats.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It has also generated an aggregated summary of these and other graph statistics in `cskg_summary.txt`. Let's print the contents of this file:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loading the TSV graph now ...\n", "graph loaded! It has 25698 nodes and 17324 edges\n", "\n", "###Top relations:\n", "/r/RelatedTo\t4552\n", "rdfs:subClassOf\t3154\n", "vg:InImage\t2996\n", "/r/Synonym\t1499\n", "mw:PartOfSpeech\t818\n", "mw:POSForm\t444\n", "/r/Antonym\t439\n", "mw:IsPOSFormOf\t438\n", "/r/FormOf\t340\n", "/r/DerivedFrom\t330\n", "\n", "###Degrees:\n", "in degree stats: mean=0.674138, std=0.060724, max=1\n", "out degree stats: mean=0.674138, std=0.010146, max=1\n", "total degree stats: mean=1.348276, std=0.061462, max=1\n", "\n", "###PageRank\n", "Max pageranks\n", "29\tmw:Verb\t0.002626\n", "33\tmw:Noun\t0.012193\n", "22231\twd:Q8054+/c/en/polypeptide/n/wn/substance+wn:polypeptide.n.01\t0.019311\n", "21928\twd:Q20747295\t0.022323\n", "22314\twd:Q7187\t0.010448\n", "\n", "###HITS\n", "HITS hubs\n", "15713\tvg:I2371025\t0.000000\n", "22314\twd:Q7187\t0.000000\n", "33\tmw:Noun\t0.000000\n", "22231\twd:Q8054+/c/en/polypeptide/n/wn/substance+wn:polypeptide.n.01\t0.000007\n", "21928\twd:Q20747295\t1.000000\n", "HITS auth\n", "24876\twd:Q62473855\t0.031174\n", "24875\twd:Q62469413\t0.031174\n", "24880\twd:Q62525750\t0.031174\n", "24879\twd:Q62525621\t0.031174\n", "21947\twd:Q15322928\t0.031174\n" ] } ], "source": [ "%%bash\n", "cat cskg_summary.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common operation is computing BERT-large embeddings over CSKG knowledge. Here is how:\n", "\n", "**Note**: This may take a significant amount of time (10-15 min)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "kgtk text_embedding -i sample_data/cskg/cskg_sample.tsv \\\n", " --debug --embedding-projector-metadata-path none \\\n", " --embedding-projector-metadata-path none \\\n", " --label-properties \"/r/Synonym\" \\\n", " --isa-properties \"/r/IsA\" \\\n", " --description-properties \"/r/DefinedAs\" \\\n", " --property-value \"/r/Causes\" \"/r/UsedFor\" \\\n", " --has-properties \"\" \\\n", " -f kgtk_format \\\n", " --output-data-format kgtk_format \\\n", " --use-cache \\\n", " --model bert-large-nli-cls-token \\\n", " > cskg_embedings.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now inspect the embeddings in `cskg_embeddings.txt`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }