{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing CSKG\n", "\n", "This notebook performs various analyses on CSKG\n", "\n", "Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n", "\n", "```\n", "papermill Example8\\ -\\ Wikidata\\ Subset.ipynb example8.out.ipynb \\\n", "-p cskg_path /Users/pedroszekely/Downloads/kypher/cskg \\\n", "-p kg cskg_connected.tsv.gz \\\n", "-p delete_database no \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters for invoking the notebook\n", "\n", "- `cskg_path`: a folder containing the CSKG edges file and all the analysis products.\n", "- `kg`: the name of the edge file.\n", "- `delete_database`: whether to delete the SQL database before running the notebook: \"\" or \"no\" means don't delete it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preamble\n", "\n", "Set up paths and environment variables" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "cskg_path = \"/Users/pedroszekely/Downloads/kypher/cskg\"\n", "kg = \"cskg_connected.tsv.gz\"\n", "delete_database = \"no\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import subprocess\n", "import sys\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import altair as alt\n", "# from IPython.display import display, HTML, Image\n", "# from pandas_profiling import ProfileReport" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "os.environ['CSKG'] = cskg_path\n", "os.environ['KG'] = \"{}/{}\".format(cskg_path, kg)\n", "os.environ['NKG'] = \"{}/cskg-normalized.tsv.gz\".format(cskg_path, kg)\n", "os.environ['STORE'] = \"{}/wikidata.sqlite3.db\".format(cskg_path)\n", "os.environ['kypher'] = \"time kgtk query --graph-cache \" + os.environ['STORE']\n", "# os.environ['kypher'] = \"time kgtk --debug query --graph-cache \" + os.environ['STORE']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher/cskg\n", "/Users/pedroszekely/Downloads/kypher/cskg/cskg_connected.tsv.gz\n", "time kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/cskg/wikidata.sqlite3.db\n", "/Users/pedroszekely/Downloads/kypher/cskg/wikidata.sqlite3.db\n" ] } ], "source": [ "!echo $CSKG\n", "!echo $KG\n", "!echo $kypher\n", "!echo $STORE" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kypher/cskg\n" ] } ], "source": [ "cd $cskg_path" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "if delete_database and delete_database != \"no\":\n", " print(\"Deleted database\")\n", " !rm $STORE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Utilities" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def bar_chart(data, x_column, y_column):\n", " \"\"\"Construct a simple bar chart with two properties\"\"\"\n", " bars = alt.Chart(data).mark_bar().encode(\n", " y=alt.Y(y_column, sort='-x'),\n", " x=x_column\n", " )\n", "\n", " text = bars.mark_text(\n", " align='left',\n", " baseline='middle',\n", " dx=3 # Nudges text to right so it doesn't appear on top of the bar\n", " ).encode(\n", " text=x_column\n", " )\n", "\n", " return (bars + text)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import io\n", "import pandas\n", "import subprocess\n", "\n", "def shell_df(command, shell=False, **kwargs):\n", " \"\"\"\n", " Takes a shell command as a string and and reads the result into a Pandas DataFrame.\n", " \n", " Additional keyword arguments are passed through to pandas.read_csv.\n", " \n", " :param command: a shell command that returns tabular data\n", " :type command: str\n", " :param shell: passed to subprocess.Popen\n", " :type shell: bool\n", " \n", " :return: a pandas dataframe\n", " :rtype: :class:`pandas.dataframe`\n", " \"\"\"\n", " proc = subprocess.Popen(command, \n", " shell=shell,\n", " stdout=subprocess.PIPE, \n", " stderr=subprocess.PIPE)\n", " output, error = proc.communicate()\n", " \n", " if proc.returncode == 0:\n", " if error:\n", " print(error.decode())\n", " with io.StringIO(output.decode()) as buffer:\n", " return pandas.read_csv(buffer, **kwargs)\n", " else:\n", " message = (\"Shell command returned non-zero exit status: {0}\\n\\n\"\n", " \"Command was:\\n{1}\\n\\n\"\n", " \"Standard error was:\\n{2}\")\n", " raise IOError(message.format(proc.returncode, command, error.decode()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Poking around" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print some lines to see what we have" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 relation node2 node1;label node2;label relation;label relation;dimension source sentence\n", "zcat: /c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000 /c/en/0.22_inch_calibre /r/IsA /c/en/5.6_millimetres 0.22 inch calibre 5.6 millimetres is a CN [[0.22 inch calibre]] is [[5.6 millimetres]]\n", "error writing to output/c/en/0/a/wn-/r/SimilarTo-/c/en/cardinal/a/wn-0000 /c/en/0/a/wn /r/SimilarTo /c/en/cardinal/a/wn 0 cardinal similar to CN [[0]] is similar to [[cardinal]]\n", ": Broken pipe\n", "/c/en/0/n/wn/quantity-/r/Synonym-/c/en/zero/n/wn/quantity-0000 /c/en/0/n/wn/quantity /r/Synonym /c/en/zero/n/wn/quantity 0 zero synonym CN [[0]] is a synonym of [[zero]]\n", "/c/en/0/n/wp/number-/r/Synonym-/c/en/0/n/wp/number-0000 /c/en/0/n/wp/number /r/Synonym /c/en/0/n/wp/number 0 0 synonym CN\n", "/c/en/0/n-/r/Antonym-/c/en/1-0000 /c/en/0/n /r/Antonym /c/en/1 0 1 antonym CN\n", "/c/en/0/n-/r/HasContext-/c/en/electrical_engineering-0000 /c/en/0/n /r/HasContext /c/en/electrical_engineering 0 electrical engineering has context CN\n", "/c/en/0/n-/r/RelatedTo-/c/en/low-0000 /c/en/0/n /r/RelatedTo /c/en/low 0 low related to CN\n", "/c/en/000/n-/r/RelatedTo-/c/en/emergency_service-0000 /c/en/000/n /r/RelatedTo /c/en/emergency_service 000 emergency service related to CN\n", "/c/en/000-/r/RelatedTo-/c/en/112-0000 /c/en/000 /r/RelatedTo /c/en/112 000 112 related to CN\n" ] } ], "source": [ "!zcat < \"$KG\" | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Normalize the file so that it is easier to process with Kypher" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: id node1 label node2 node1;label node2;label label;label label;dimension source sentence\n", "error writing to output: Broken pipe\n", "/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000 /c/en/0.22_inch_calibre /r/IsA /c/en/5.6_millimetres 0.22 inch calibre 5.6 millimetres is a CN [[0.22 inch calibre]] is [[5.6 millimetres]]\n", "/c/en/0/a/wn-/r/SimilarTo-/c/en/cardinal/a/wn-0000 /c/en/0/a/wn /r/SimilarTo /c/en/cardinal/a/wn 0 cardinal similar to CN [[0]] is similar to [[cardinal]]\n", "/c/en/0/n/wn/quantity-/r/Synonym-/c/en/zero/n/wn/quantity-0000 /c/en/0/n/wn/quantity /r/Synonym /c/en/zero/n/wn/quantity 0 zero synonym CN [[0]] is a synonym of [[zero]]\n", "/c/en/0/n/wp/number-/r/Synonym-/c/en/0/n/wp/number-0000 /c/en/0/n/wp/number /r/Synonym /c/en/0/n/wp/number 0 0 synonym CN\n", "/c/en/0/n-/r/Antonym-/c/en/1-0000 /c/en/0/n /r/Antonym /c/en/1 0 1 antonym CN\n", "/c/en/0/n-/r/HasContext-/c/en/electrical_engineering-0000 /c/en/0/n /r/HasContext /c/en/electrical_engineering 0 electrical engineering has context CN\n", "/c/en/0/n-/r/RelatedTo-/c/en/low-0000 /c/en/0/n /r/RelatedTo /c/en/low 0 low related to CN\n", "/c/en/000/n-/r/RelatedTo-/c/en/emergency_service-0000 /c/en/000/n /r/RelatedTo /c/en/emergency_service 000 emergency service related to CN\n", "/c/en/000-/r/RelatedTo-/c/en/112-0000 /c/en/000 /r/RelatedTo /c/en/112 000 112 related to CN\n" ] } ], "source": [ "!zcat < \"$KG\" | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Count the number of edges and nodes" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 20.31 real 13.03 user 5.04 sys\n", "num_edges num_nodes num_relations num_values\n", "6003237 1511776 81 1031520\n" ] } ], "source": [ "!$kypher -i \"$KG\" \\\n", "--match '(n1)-[e]->(n2)' \\\n", "--return 'count(e) as num_edges, count(distinct n1) as num_nodes, count(distinct e.relation) as num_relations, count(distinct n2) as num_values' \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Some Statistics" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 35.79 real 8.40 user 6.64 sys\n", "relation nodes\n", "/r/RelatedTo 554822\n", "/r/FormOf 376992\n", "/r/DerivedFrom 262822\n", "/r/IsA 236604\n", "/r/Synonym 229295\n", "/r/HasContext 182829\n", "/r/Antonym 37990\n", "/r/PartOf 26890\n", "at:xReact 24312\n", "at:xAttr 24312\n", "at:xWant 24158\n", "at:xEffect 23255\n", "at:xNeed 22146\n", "/r/EtymologicallyRelatedTo 21667\n", "at:xIntent 21371\n", "/r/SimilarTo 15834\n", "at:oWant 14669\n", "at:oReact 14070\n", "/r/CapableOf 10907\n", "at:oEffect 10895\n", "/r/AtLocation 9958\n", "/r/MannerOf 9896\n", "/r/HasProperty 6946\n", "/r/UsedFor 5948\n", "/r/LocatedNear 5728\n", "/r/DistinctFrom 5595\n", "mw:MayHaveProperty 5037\n", "/r/HasPrerequisite 3823\n", "/r/ReceivesAction 3822\n", "/r/CausesDesire 3581\n", "/r/HasA 3358\n", "/r/dbpedia/genus 2924\n", "/r/MadeOf 2330\n", "/r/dbpedia/genre 2146\n", "/r/HasSubevent 1989\n", "/r/Causes 1940\n", "/r/DefinedAs 1837\n", "/r/InstanceOf 1404\n", "fn:HasFrameElement 1221\n", "fn:HasLexicalUnit 1073\n", "/r/MotivatedByGoal 969\n", "mw:HasInstance 964\n", "fn:InheritsFrom 741\n", "/r/dbpedia/language 732\n", "/r/HasFirstSubevent 672\n", "/r/dbpedia/occupation 648\n", "/r/HasLastSubevent 625\n", "fn:Uses 483\n", "/r/Desires 415\n", "/r/dbpedia/field 409\n", "/r/dbpedia/capital 403\n", "/r/Entails 378\n", "/r/CreatedBy 376\n", "/r/dbpedia/knownFor 360\n", "/r/dbpedia/influencedBy 344\n", "/r/NotHasProperty 297\n", "fn:IsUsedBy 287\n", "fn:HasSemType 275\n", "fn:IsInheritedBy 258\n", "/r/NotDesires 255\n", "/r/dbpedia/product 244\n", "/r/NotCapableOf 223\n", "fn:fe:ExcludesFE 208\n", "fn:SubframeOf 130\n", "fn:PerspectiveOn 127\n", "fn:ReframingMapping 115\n", "fn:IsPrecededBy 80\n", "fn:Precedes 77\n", "/r/EtymologicallyDerivedFrom 71\n", "/r/dbpedia/leader 69\n", "fn:fe:RequiresFE 68\n", "fn:IsPerspectivizedIn 67\n", "fn:IsCausativeOf 59\n", "fn:SeeAlso 56\n", "fn:HasSubframe 50\n", "fn:st:SuperType 30\n", "fn:st:RootType 30\n", "fn:IsInchoativeOf 19\n", "fn:st:SubType 11\n", "fn:Metaphor 4\n", "/r/SymbolOf 4\n" ] } ], "source": [ "!$kypher -i \"$KG\" \\\n", "--match '(n1)-[e]->(n2)' \\\n", "--return 'distinct e.relation, count(distinct n1) as nodes' \\\n", "--order-by 'count(distinct n1) desc' \\\n", "| column -t -s $'\\t' " ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9.96 real 8.46 user 1.39 sys\n", "\n" ] } ], "source": [ "command = \"$kypher -i $KG \\\n", "--match '(n1)-[e]->(n2)' \\\n", "--return 'distinct e.relation, count(distinct n1) as nodes' \\\n", "--order-by 'count(distinct n1) desc'\"\n", "data = shell_df(command, shell=True, sep='\\t')" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bar_chart(data, 'nodes', 'relation')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First find pairs of nodes `n1` and `n2` that share a common label. To avoid outputting the cross product, test `n1 < n2`. If we do `n1 <= n2` we should also get the reflexive relation, every node equal to itself. Unfortunately, this makes the file much larger and the next commands take a very long time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build the clusters" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1160.11 real 1119.48 user 25.53 sys\n" ] } ], "source": [ "!$kypher -i \"$KG\" \\\n", "--match '(n1 {label: label})-[]->(), (n2 {label: label})-[]->()' \\\n", "--where 'n1 < n2' \\\n", "--return 'distinct n1 as node_x, n2 as node_y, \"same_name\" as relation, label as common_label' \\\n", "--order-by 'label' \\\n", "-o $CSKG/same_name.tsv.gz" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1317833\n" ] } ], "source": [ "!zcat < $CSKG/same_name.tsv.gz | wc -l" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rename the edges and add ids so that we can use the file in KGTK" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "!kgtk rename-columns --mode NONE -i $CSKG/same_name.tsv.gz --output-columns node1 node2 relation common_label \\\n", "/ add-id --id-style node1-label-node2 -o $CSKG/same_name_edges.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what we got" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tnode2\trelation\tcommon_label\tid\n", "fn:fe:abundant_entities\tfn:fe:abuser\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:abuser\n", "fn:fe:abundant_entities\tfn:fe:accessibility\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:accessibility\n", "fn:fe:abundant_entities\tfn:fe:accoutrement\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:accoutrement\n", "fn:fe:abundant_entities\tfn:fe:accuracy\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:accuracy\n", "fn:fe:abundant_entities\tfn:fe:accused\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:accused\n", "fn:fe:abundant_entities\tfn:fe:act\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:act\n", "fn:fe:abundant_entities\tfn:fe:action\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:action\n", "fn:fe:abundant_entities\tfn:fe:activists\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:activists\n", "fn:fe:abundant_entities\tfn:fe:activity\tsame_name\t\tfn:fe:abundant_entities-same_name-fn:fe:activity\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < $CSKG/same_name_edges.tsv.gz | head" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 node2 relation common_label id\n", "/c/en/pentobarbital /c/en/pentobarbital/n same_name pentobarbital /c/en/pentobarbital-same_name-/c/en/pentobarbital/n\n", "/c/en/piet /c/en/piet/n same_name piet /c/en/piet-same_name-/c/en/piet/n\n", "/c/en/plymouth_county /c/en/plymouth_county/n same_name plymouth county /c/en/plymouth_county-same_name-/c/en/plymouth_county/n\n", "/c/en/postracial /c/en/postracial/a same_name postracial /c/en/postracial-same_name-/c/en/postracial/a\n", "/c/en/printout /c/en/printout/n same_name printout /c/en/printout-same_name-/c/en/printout/n\n", "/c/en/pug/n/wn/animal /c/en/pug/v/wikt/en_4 same_name pug /c/en/pug/n/wn/animal-same_name-/c/en/pug/v/wikt/en_4\n", "/c/en/raddle /c/en/raddle/v/wn/contact same_name raddle /c/en/raddle-same_name-/c/en/raddle/v/wn/contact\n", "/c/en/reclothe /c/en/reclothe/v same_name reclothe /c/en/reclothe-same_name-/c/en/reclothe/v\n", "/c/en/repetition/v/wikt/en_2 Q18699055 same_name repetition /c/en/repetition/v/wikt/en_2-same_name-Q18699055\n" ] } ], "source": [ "!kgtk cat --every-nth-record 10000 --initial-skip-count 1000000 -i $CSKG/same_name_edges.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's form cluster of all `node1` that share a commmon label. We make the common label be the identifier of the cluster, and put the nodes as members." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 27.34 real 25.69 user 4.16 sys\n" ] } ], "source": [ "!$kypher -i $CSKG/same_name_edges.tsv.gz \\\n", "--match '(n1)-[l {common_label: common}]->()' \\\n", "--where 'common != \"\"' \\\n", "--return 'common as node_x, \"cluster_member\" as relation, n1 as node_y' \\\n", "--order-by 'common' \\\n", "-o $CSKG/temp.cluster.node1.tsv.gz " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do the same with `node2` so that they are also members of the clusters." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 5.66 real 5.29 user 0.30 sys\n" ] } ], "source": [ "!$kypher -i $CSKG/same_name_edges.tsv.gz \\\n", "--match '()-[l {common_label: common}]->(n2)' \\\n", "--where 'common != \"\"' \\\n", "--return 'common as node_x, \"cluster_member\" as relation, n2 as node_y' \\\n", "--order-by 'common' \\\n", "-o $CSKG/temp.cluster.node2.tsv.gz " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 853667 2713904 37415009\n", " 853667 2713908 38917319\n", "zcat: error writing to output: Broken pipe\n", "node_x relation node_y\n", "\"meerkats\" tv show cluster_member /c/en/meerkat/n/wn/animal\n", "'zeros - europe' cluster_member /c/en/europe/n/wn/group\n", "0 cluster_member /c/en/0/a/wn\n", "0 cluster_member /c/en/0/a/wn\n", "0 cluster_member /c/en/0/a/wn\n", "0 cluster_member /c/en/0/a/wn\n", "0 cluster_member /c/en/0/n/wn/quantity\n", "0 cluster_member /c/en/0/n/wn/quantity\n", "0 cluster_member /c/en/0/n/wp/number\n", "zcat: node_x relation node_y\n", "\"meerkats\" tv show cluster_member /c/en/television_program/n/wn/communication\n", "'zeros - europe' cluster_member /c/en/nothing/n/wn/quantity\n", "0 cluster_member /c/en/0/n\n", "0 cluster_member /c/en/0/n/wn/quantity\n", "error writing to output0 cluster_member /c/en/0/n/wp/number\n", ": 0 cluster_member /c/en/zero/n/wn/quantity\n", "Broken pipe\n", "0 cluster_member /c/en/0/n/wp/number\n", "0 cluster_member /c/en/zero/n/wn/quantity\n", "0 cluster_member /c/en/zero/n/wn/quantity\n" ] } ], "source": [ "!zcat < $CSKG/temp.cluster.node1.tsv.gz | wc\n", "!zcat < $CSKG/temp.cluster.node2.tsv.gz | wc\n", "!zcat < $CSKG/temp.cluster.node1.tsv.gz | head | column -t -s $'\\t' \n", "!zcat < $CSKG/temp.cluster.node2.tsv.gz | head | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We had to use `node_x` and `node_y` as the names of the columns because kypher refused to output them as `node1` and `node2`. Now we have to rename them." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "!kgtk rename-columns --mode NONE --output-columns node1 label node2 -i $CSKG/temp.cluster.node1.tsv.gz -o $CSKG/temp.cluster.node1.renamed.tsv.gz" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "!kgtk rename-columns --mode NONE --output-columns node1 label node2 -i $CSKG/temp.cluster.node2.tsv.gz -o $CSKG/temp.cluster.node2.renamed.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now concatenate the two cluster files, and add ids based on `node1/relation/node2` so that we can deduplicate later." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i $CSKG/temp.cluster.node1.renamed.tsv.gz -i $CSKG/temp.cluster.node2.renamed.tsv.gz \\\n", "/ add-id --id-style node1-label-node2 \\\n", "/ sort2 \\\n", "-o $CSKG/temp.name.clusters.1.tsv.gz" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: node1 label node2 id\n", "error writing to output\"meerkats\" tv show cluster_member /c/en/meerkat/n/wn/animal \"meerkats\" tv show-cluster_member-/c/en/meerkat/n/wn/animal\n", ": Broken pipe\n", "\"meerkats\" tv show cluster_member /c/en/television_program/n/wn/communication \"meerkats\" tv show-cluster_member-/c/en/television_program/n/wn/communication\n", "'zeros - europe' cluster_member /c/en/europe/n/wn/group 'zeros - europe'-cluster_member-/c/en/europe/n/wn/group\n", "'zeros - europe' cluster_member /c/en/nothing/n/wn/quantity 'zeros - europe'-cluster_member-/c/en/nothing/n/wn/quantity\n", "0 100 cluster_member /c/en/0_100 0 100-cluster_member-/c/en/0_100\n", "0 100 cluster_member /c/en/0_100/n 0 100-cluster_member-/c/en/0_100/n\n", "0 4 0 0 4 0 cluster_member /c/en/0_4_0_0_4_0 0 4 0 0 4 0-cluster_member-/c/en/0_4_0_0_4_0\n", "0 4 0 0 4 0 cluster_member /c/en/0_4_0_0_4_0/n 0 4 0 0 4 0-cluster_member-/c/en/0_4_0_0_4_0/n\n", "0 60 0 cluster_member /c/en/0_60_0 0 60 0-cluster_member-/c/en/0_60_0\n" ] } ], "source": [ "!zcat < $CSKG/temp.name.clusters.1.tsv.gz | head -10 | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have lots of duplicates, so let's get rid of them using the compact command (BTW, the --presorted flag does not work even though the file was the output of `sort2`)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "!kgtk compact -i $CSKG/temp.name.clusters.1.tsv.gz -o $CSKG/temp.name.clusters.2.tsv.gz" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 label node2 id\n", "zcat: \"meerkats\" tv show cluster_member /c/en/meerkat/n/wn/animal \"meerkats\" tv show-cluster_member-/c/en/meerkat/n/wn/animal\n", "\"meerkats\" tv show cluster_member /c/en/television_program/n/wn/communication \"meerkats\" tv show-cluster_member-/c/en/television_program/n/wn/communication\n", "'zeros - europe' cluster_member /c/en/europe/n/wn/group 'zeros - europe'-cluster_member-/c/en/europe/n/wn/group\n", "'zeros - europe' cluster_member /c/en/nothing/n/wn/quantity 'zeros - europe'-cluster_member-/c/en/nothing/n/wn/quantity\n", "0 cluster_member /c/en/0 0-cluster_member-/c/en/0\n", "0 cluster_member /c/en/0/a/wn 0-cluster_member-/c/en/0/a/wn\n", "error writing to output0 cluster_member /c/en/0/n 0-cluster_member-/c/en/0/n\n", ": Broken pipe\n", "0 cluster_member /c/en/0/n/wn/quantity 0-cluster_member-/c/en/0/n/wn/quantity\n", "0 cluster_member /c/en/0/n/wp/number 0-cluster_member-/c/en/0/n/wp/number\n" ] } ], "source": [ "!zcat < $CSKG/temp.name.clusters.2.tsv.gz | head -10 | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For fun, lets look at the cluster for `belt`" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 14.09 real 11.78 user 2.99 sys\n", "node1 label node2 id\n", "belt cluster_member /c/en/belt belt-cluster_member-/c/en/belt\n", "belt cluster_member /c/en/belt/n belt-cluster_member-/c/en/belt/n\n", "belt cluster_member /c/en/belt/n/opencyc/belt_clothing belt-cluster_member-/c/en/belt/n/opencyc/belt_clothing\n", "belt cluster_member /c/en/belt/n/opencyc/belt_mechanical belt-cluster_member-/c/en/belt/n/opencyc/belt_mechanical\n", "belt cluster_member /c/en/belt/n/opencyc/belt_region belt-cluster_member-/c/en/belt/n/opencyc/belt_region\n", "belt cluster_member /c/en/belt/n/wn/act belt-cluster_member-/c/en/belt/n/wn/act\n", "belt cluster_member /c/en/belt/n/wn/artifact belt-cluster_member-/c/en/belt/n/wn/artifact\n", "belt cluster_member /c/en/belt/n/wn/event belt-cluster_member-/c/en/belt/n/wn/event\n", "belt cluster_member /c/en/belt/n/wn/location belt-cluster_member-/c/en/belt/n/wn/location\n", "belt cluster_member /c/en/belt/n/wn/object belt-cluster_member-/c/en/belt/n/wn/object\n", "belt cluster_member /c/en/belt/v belt-cluster_member-/c/en/belt/v\n", "belt cluster_member /c/en/belt/v/wn/contact belt-cluster_member-/c/en/belt/v/wn/contact\n", "belt cluster_member /c/en/belt/v/wn/creation belt-cluster_member-/c/en/belt/v/wn/creation\n", "belt cluster_member Q134560 belt-cluster_member-Q134560\n", "belt cluster_member Q623755 belt-cluster_member-Q623755\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \\\n", "--match '(cluster:`belt`)-[l]->(n2)' \\\n", "--limit 20 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[90mid\u001b[39m Q134560\n", "\u001b[42mLabel\u001b[49m belt\n", "\u001b[44mDescription\u001b[49m worn band or braid, usually around the waist or hips\n", "\u001b[30m\u001b[47msubclass of\u001b[49m\u001b[39m \u001b[90m(P279)\u001b[39m\u001b[90m: \u001b[39mcostume accessory \u001b[90m(Q1065579)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q623755\n", "\u001b[42mLabel\u001b[49m belt\n", "\u001b[44mDescription\u001b[49m loop of flexible material used to mechanically link rotating shafts\n", "\u001b[30m\u001b[47msubclass of\u001b[49m\u001b[39m \u001b[90m(P279)\u001b[39m\u001b[90m: \u001b[39mdevice \u001b[90m(Q1183543)\u001b[39m\n" ] } ], "source": [ "!wd u Q134560 Q623755" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Look at popular clusters" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.86 real 1.52 user 0.30 sys\n", "\n" ] } ], "source": [ "command = \"$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \\\n", "--match '(cluster)-[l]-(member)' \\\n", "--return 'distinct cluster as node, count(distinct member) as count' \\\n", "--order-by 'count(distinct member) desc' \\\n", "--limit 50\" \n", "data = shell_df(command, shell=True, sep='\\t')" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bar_chart(data, 'count', 'node')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.83 real 0.66 user 0.15 sys\n", "node1 label node2 id\n", "news in brief cluster_member Q56836285 news in brief-cluster_member-Q56836285\n", "news in brief cluster_member Q58965155 news in brief-cluster_member-Q58965155\n", "news in brief cluster_member Q58965282 news in brief-cluster_member-Q58965282\n", "news in brief cluster_member Q58965656 news in brief-cluster_member-Q58965656\n", "news in brief cluster_member Q58965794 news in brief-cluster_member-Q58965794\n", "news in brief cluster_member Q58965916 news in brief-cluster_member-Q58965916\n", "news in brief cluster_member Q58966165 news in brief-cluster_member-Q58966165\n", "news in brief cluster_member Q58979818 news in brief-cluster_member-Q58979818\n", "news in brief cluster_member Q58979822 news in brief-cluster_member-Q58979822\n", "news in brief cluster_member Q58980098 news in brief-cluster_member-Q58980098\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \\\n", "--match '(cluster:`news in brief`)-[l]->(n2)' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, we don't want this one." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[90mid\u001b[39m Q56836285\n", "\u001b[42mLabel\u001b[49m news in brief\n", "\u001b[44mDescription\u001b[49m scientific article published in Nature\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mscholarly article \u001b[90m(Q13442814)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q58965155\n", "\u001b[42mLabel\u001b[49m news in brief\n", "\u001b[44mDescription\u001b[49m article publié dans la revue scientifique Nature\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mscholarly article \u001b[90m(Q13442814)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q58965282\n", "\u001b[42mLabel\u001b[49m news in brief\n", "\u001b[44mDescription\u001b[49m article publié dans la revue scientifique Nature\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mscholarly article \u001b[90m(Q13442814)\u001b[39m\n" ] } ], "source": [ "!wd u Q56836285 Q58965155 Q58965282" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.84 real 0.68 user 0.14 sys\n", "node1 label node2 id\n", "flute cluster_member /c/en/flute flute-cluster_member-/c/en/flute\n", "flute cluster_member /c/en/flute/n flute-cluster_member-/c/en/flute/n\n", "flute cluster_member /c/en/flute/n/wikt/en_1 flute-cluster_member-/c/en/flute/n/wikt/en_1\n", "flute cluster_member /c/en/flute/n/wikt/en_2 flute-cluster_member-/c/en/flute/n/wikt/en_2\n", "flute cluster_member /c/en/flute/n/wn/artifact flute-cluster_member-/c/en/flute/n/wn/artifact\n", "flute cluster_member /c/en/flute/v/wikt/en_1 flute-cluster_member-/c/en/flute/v/wikt/en_1\n", "flute cluster_member /c/en/flute/v/wn/contact flute-cluster_member-/c/en/flute/v/wn/contact\n", "flute cluster_member Q89192698 flute-cluster_member-Q89192698\n", "flute cluster_member Q89192704 flute-cluster_member-Q89192704\n", "flute cluster_member Q89192707 flute-cluster_member-Q89192707\n", "flute cluster_member Q89192713 flute-cluster_member-Q89192713\n", "flute cluster_member Q89192718 flute-cluster_member-Q89192718\n", "flute cluster_member Q89192720 flute-cluster_member-Q89192720\n", "flute cluster_member Q89192724 flute-cluster_member-Q89192724\n", "flute cluster_member Q89192729 flute-cluster_member-Q89192729\n", "flute cluster_member Q89192732 flute-cluster_member-Q89192732\n", "flute cluster_member Q89192736 flute-cluster_member-Q89192736\n", "flute cluster_member Q89192740 flute-cluster_member-Q89192740\n", "flute cluster_member Q89192755 flute-cluster_member-Q89192755\n", "flute cluster_member Q89192758 flute-cluster_member-Q89192758\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \\\n", "--match '(cluster:`flute`)-[l]->(n2)' \\\n", "--limit 20 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm, those specific flutes probably don't belong in CSKG" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[90mid\u001b[39m Q89192698\n", "\u001b[42mLabel\u001b[49m flute\n", "\u001b[44mDescription\u001b[49m Flute, Johann Georg Braun, Mannheim, 1816–1833\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mflute \u001b[90m(Q5462939)\u001b[39m | flute \u001b[90m(Q11405)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q89192704\n", "\u001b[42mLabel\u001b[49m flute\n", "\u001b[44mDescription\u001b[49m Flute, Cortellini, Turin, second quarter of 19th century\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mflute \u001b[90m(Q5462939)\u001b[39m | flute \u001b[90m(Q11405)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q89192707\n", "\u001b[42mLabel\u001b[49m flute\n", "\u001b[44mDescription\u001b[49m Flute, Cornelius Ward, London, c. 1842\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mflute \u001b[90m(Q5462939)\u001b[39m | flute \u001b[90m(Q11405)\u001b[39m\n" ] } ], "source": [ "!wd u Q89192698 Q89192704 Q89192707" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.87 real 0.68 user 0.16 sys\n", "node1 label node2 id\n", "break cluster_member /c/en/absconder/n/wn/person break-cluster_member-/c/en/absconder/n/wn/person\n", "break cluster_member /c/en/american_civil_war/n/wn/act break-cluster_member-/c/en/american_civil_war/n/wn/act\n", "break cluster_member /c/en/break break-cluster_member-/c/en/break\n", "break cluster_member /c/en/break/n/wikt/en_1 break-cluster_member-/c/en/break/n/wikt/en_1\n", "break cluster_member /c/en/break/n/wikt/en_2 break-cluster_member-/c/en/break/n/wikt/en_2\n", "break cluster_member /c/en/break/n/wn/geology break-cluster_member-/c/en/break/n/wn/geology\n", "break cluster_member /c/en/break/n/wn/state break-cluster_member-/c/en/break/n/wn/state\n", "break cluster_member /c/en/break/n/wn/tennis break-cluster_member-/c/en/break/n/wn/tennis\n", "break cluster_member /c/en/break/n/wn/time break-cluster_member-/c/en/break/n/wn/time\n", "break cluster_member /c/en/break/n/wp/music break-cluster_member-/c/en/break/n/wp/music\n", "break cluster_member /c/en/break/v/wikt/en_1 break-cluster_member-/c/en/break/v/wikt/en_1\n", "break cluster_member /c/en/break/v/wn/billiards break-cluster_member-/c/en/break/v/wn/billiards\n", "break cluster_member /c/en/break/v/wn/body break-cluster_member-/c/en/break/v/wn/body\n", "break cluster_member /c/en/break/v/wn/cognition break-cluster_member-/c/en/break/v/wn/cognition\n", "break cluster_member /c/en/break/v/wn/communication break-cluster_member-/c/en/break/v/wn/communication\n", "break cluster_member /c/en/break/v/wn/competition break-cluster_member-/c/en/break/v/wn/competition\n", "break cluster_member /c/en/break/v/wn/contact break-cluster_member-/c/en/break/v/wn/contact\n", "break cluster_member /c/en/break/v/wn/emotion break-cluster_member-/c/en/break/v/wn/emotion\n", "break cluster_member /c/en/break/v/wn/military break-cluster_member-/c/en/break/v/wn/military\n", "break cluster_member /c/en/break/v/wn/possession break-cluster_member-/c/en/break/v/wn/possession\n", "break cluster_member /c/en/break/v/wn/stative break-cluster_member-/c/en/break/v/wn/stative\n", "break cluster_member /c/en/interruption/n/wn/event break-cluster_member-/c/en/interruption/n/wn/event\n", "break cluster_member Q1681122 break-cluster_member-Q1681122\n", "break cluster_member Q2707973 break-cluster_member-Q2707973\n", "break cluster_member Q55398038 break-cluster_member-Q55398038\n", "break cluster_member Q903577 break-cluster_member-Q903577\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz \\\n", "--match '(cluster:`break`)-[l]->(n2)' \\\n", "--limit 100 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[90mid\u001b[39m Q1681122\n", "\u001b[42mLabel\u001b[49m break\n", "\u001b[44mDescription\u001b[49m period of time during a shift in which an employee is allowed to take time off\n", "\u001b[30m\u001b[47msubclass of\u001b[49m\u001b[39m \u001b[90m(P279)\u001b[39m\u001b[90m: \u001b[39mtime interval \u001b[90m(Q186081)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q2707973\n", "\u001b[42mLabel\u001b[49m break\n", "\u001b[44mDescription\u001b[49m tennis\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39msports terminology \u001b[90m(Q28829877)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m Q55398038\n", "\u001b[42mLabel\u001b[49m break\n", "\u001b[44mDescription\u001b[49m in cue sports\n", "\n" ] } ], "source": [ "!wd u Q1681122 Q2707973 Q55398038" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Relations among clusters" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.71 real 1.32 user 0.32 sys\n", "relation value\n", "/r/AtLocation /c/en/disneyland\n", "/r/AtLocation /c/en/freezer\n", "/r/AtLocation /c/en/movie\n", "/r/AtLocation /c/en/party\n", "/r/CapableOf /c/en/delight_child\n", "/r/CapableOf /c/en/melt\n", "/r/CapableOf /c/en/taste_sweet\n", "/r/CapableOf /c/en/earth_science/n/wn/cognition\n", "/r/CapableOf /c/en/melt/v/wn/change\n", "/r/CapableOf /c/en/scoop/v/wn/contact\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \\\n", "--match 'clusters: (cluster:`ice cream`)-[l]->(n2), cskg: (n2)-[rid]->(object)' \\\n", "--return 'distinct rid.relation as relation, object as value' \\\n", "--order-by 'rid.relation' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 151.07 real 109.45 user 10.77 sys\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \\\n", "--match 'clusters: (cluster)-[l]->(n2), cskg: (n2)-[rid]->(object), clusters: (word)-[]->(object)' \\\n", "--return 'distinct cluster as subject, rid.relation as relation, word as value, rid.source as source' \\\n", "--order-by 'cluster, rid.relation, rid.source, word' \\\n", "-o $CSKG/relations.tsv.gz" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 10931565 49980020 424067168\n" ] } ], "source": [ "!zcat < $CSKG/relations.tsv.gz | wc" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: error writing to output: Broken pipe\n", "subject relation value source\n", "\"meerkats\" tv show /r/CapableOf glance VG\n", "\"meerkats\" tv show /r/IsA \"meerkats\" tv show CN|WN\n", "\"meerkats\" tv show /r/IsA broadcast CN|WN\n", "\"meerkats\" tv show /r/IsA meerkat CN|WN\n", "\"meerkats\" tv show /r/IsA network CN|WN\n", "\"meerkats\" tv show /r/IsA slightly CN|WN\n", "\"meerkats\" tv show /r/LocatedNear television VG\n", "\"meerkats\" tv show /r/LocatedNear tv VG\n", "\"meerkats\" tv show /r/PartOf \"meerkats\" tv show WN\n" ] } ], "source": [ "!zcat < $CSKG/relations.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 166.52 real 147.83 user 7.20 sys\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \\\n", "--match 'clusters: (cluster)-[l]->(n2), cskg: (n2)-[rid {relation: rel_label}]->(object), clusters: (word)-[]->(object)' \\\n", "--return 'distinct rid.source as source, cluster as subject, rel_label as `relation id`, word as value, rel_label.label as relation' \\\n", "--order-by 'cluster, rid.relation, rid.source, word' \\\n", "-o $CSKG/relations-detailed.tsv.gz" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 12626473 102940980 903658537\n" ] } ], "source": [ "!zcat < $CSKG/relations-detailed.tsv.gz | wc" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: source subject relation id value relation\n", "VG \"meerkats\" tv show /r/CapableOf glance capable of\n", "CN|WN \"meerkats\" tv show /r/IsA \"meerkats\" tv show is a\n", "error writing to outputCN|WN \"meerkats\" tv show /r/IsA broadcast is a\n", ": CN|WN \"meerkats\" tv show /r/IsA meerkat is a\n", "Broken pipe\n", "CN|WN \"meerkats\" tv show /r/IsA network is a\n", "CN|WN \"meerkats\" tv show /r/IsA slightly is a\n", "VG \"meerkats\" tv show /r/LocatedNear television on front of|playing on|written on\n", "VG \"meerkats\" tv show /r/LocatedNear tv on front of|playing on|written on\n", "WN \"meerkats\" tv show /r/PartOf \"meerkats\" tv show is a part of\n" ] } ], "source": [ "!zcat < $CSKG/relations-detailed.tsv.gz | head | column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.13 real 0.74 user 0.20 sys\n", "source subject relation id value relation\n", "CN teddy bear /r/AtLocation bed at location\n", "CN teddy bear /r/AtLocation home at location\n", "CN teddy bear /r/AtLocation shelf at location\n", "VG teddy bear /r/CapableOf alive capable of\n", "VG teddy bear /r/CapableOf appear capable of\n", "VG teddy bear /r/CapableOf asleep capable of\n", "VG teddy bear /r/CapableOf bat capable of\n", "VG teddy bear /r/CapableOf be capable of\n", "VG teddy bear /r/CapableOf beamish capable of\n", "VG teddy bear /r/CapableOf bear capable of\n", "VG teddy bear /r/CapableOf beckon capable of\n", "VG teddy bear /r/CapableOf body count capable of\n", "VG teddy bear /r/CapableOf botch capable of\n", "VG teddy bear /r/CapableOf box|game capable of\n", "VG teddy bear /r/CapableOf bury capable of\n", "VG teddy bear /r/CapableOf carve capable of\n", "VG teddy bear /r/CapableOf catch capable of\n", "VG teddy bear /r/CapableOf catch|get capable of\n", "VG teddy bear /r/CapableOf cling capable of\n", "VG teddy bear /r/CapableOf clop capable of\n", "VG teddy bear /r/CapableOf confront capable of\n", "VG teddy bear /r/CapableOf conserve capable of\n", "VG teddy bear /r/CapableOf dance capable of\n", "VG teddy bear /r/CapableOf dash capable of\n", "VG teddy bear /r/CapableOf digitalize capable of\n", "VG teddy bear /r/CapableOf discontinue capable of\n", "VG teddy bear /r/CapableOf drive capable of\n", "VG teddy bear /r/CapableOf drives capable of\n", "VG teddy bear /r/CapableOf driving capable of\n", "VG teddy bear /r/CapableOf embrace capable of\n", "VG teddy bear /r/CapableOf endow capable of\n", "VG teddy bear /r/CapableOf exhibit capable of\n", "VG teddy bear /r/CapableOf explanatory capable of\n", "VG teddy bear /r/CapableOf expose capable of\n", "VG teddy bear /r/CapableOf fall capable of\n", "VG teddy bear /r/CapableOf fall from grace capable of\n", "VG teddy bear /r/CapableOf fall leaves capable of\n", "VG teddy bear /r/CapableOf fall of man capable of\n", "VG teddy bear /r/CapableOf fallen capable of\n", "VG teddy bear /r/CapableOf falls capable of\n", "VG teddy bear /r/CapableOf fly capable of\n", "VG teddy bear /r/CapableOf game capable of\n", "VG teddy bear /r/CapableOf games capable of\n", "VG teddy bear /r/CapableOf get capable of\n", "VG teddy bear /r/CapableOf half closed smile capable of\n", "VG teddy bear /r/CapableOf hang capable of\n", "VG teddy bear /r/CapableOf hard drive capable of\n", "VG teddy bear /r/CapableOf holiday capable of\n", "VG teddy bear /r/CapableOf holidays capable of\n", "VG teddy bear /r/CapableOf holiday|vacation capable of\n", "VG teddy bear /r/CapableOf insurance capable of\n", "VG teddy bear /r/CapableOf inverse capable of\n", "VG teddy bear /r/CapableOf keep capable of\n", "VG teddy bear /r/CapableOf keep in capable of\n", "VG teddy bear /r/CapableOf keep|preserve capable of\n", "VG teddy bear /r/CapableOf kiss capable of\n", "VG teddy bear /r/CapableOf kisses capable of\n", "VG teddy bear /r/CapableOf laugh capable of\n", "VG teddy bear /r/CapableOf lean capable of\n", "VG teddy bear /r/CapableOf liberate capable of\n", "VG teddy bear /r/CapableOf lie capable of\n", "VG teddy bear /r/CapableOf listen capable of\n", "VG teddy bear /r/CapableOf live capable of\n", "VG teddy bear /r/CapableOf look capable of\n", "VG teddy bear /r/CapableOf lounge capable of\n", "VG teddy bear /r/CapableOf lunge capable of\n", "VG teddy bear /r/CapableOf match capable of\n", "VG teddy bear /r/CapableOf mettlesome capable of\n", "VG teddy bear /r/CapableOf monologist capable of\n", "VG teddy bear /r/CapableOf mouth|smile capable of\n", "VG teddy bear /r/CapableOf nap capable of\n", "VG teddy bear /r/CapableOf operate capable of\n", "VG teddy bear /r/CapableOf optical drive capable of\n", "VG teddy bear /r/CapableOf oversleep capable of\n", "VG teddy bear /r/CapableOf photograph capable of\n", "VG teddy bear /r/CapableOf play capable of\n", "VG teddy bear /r/CapableOf pocket watch capable of\n", "VG teddy bear /r/CapableOf precipitate capable of\n", "VG teddy bear /r/CapableOf premium capable of\n", "VG teddy bear /r/CapableOf prepare capable of\n", "VG teddy bear /r/CapableOf put capable of\n", "VG teddy bear /r/CapableOf read capable of\n", "VG teddy bear /r/CapableOf relax capable of\n", "VG teddy bear /r/CapableOf release capable of\n", "VG teddy bear /r/CapableOf rest capable of\n", "VG teddy bear /r/CapableOf ride capable of\n", "VG teddy bear /r/CapableOf rides capable of\n", "VG teddy bear /r/CapableOf run capable of\n", "VG teddy bear /r/CapableOf running capable of\n", "VG teddy bear /r/CapableOf runs capable of\n", "VG teddy bear /r/CapableOf secrete capable of\n", "VG teddy bear /r/CapableOf see capable of\n", "VG teddy bear /r/CapableOf show capable of\n", "VG teddy bear /r/CapableOf sink capable of\n", "VG teddy bear /r/CapableOf sit capable of\n", "VG teddy bear /r/CapableOf sleep capable of\n", "VG teddy bear /r/CapableOf sleeping capable of\n", "VG teddy bear /r/CapableOf slumber capable of\n", "VG teddy bear /r/CapableOf slump capable of\n", "VG teddy bear /r/CapableOf smile capable of\n", "VG teddy bear /r/CapableOf smiler capable of\n", "VG teddy bear /r/CapableOf smiling capable of\n", "VG teddy bear /r/CapableOf spill capable of\n", "VG teddy bear /r/CapableOf spill beans capable of\n", "VG teddy bear /r/CapableOf sprint capable of\n", "VG teddy bear /r/CapableOf stack capable of\n", "VG teddy bear /r/CapableOf stacks capable of\n", "VG teddy bear /r/CapableOf stand capable of\n", "VG teddy bear /r/CapableOf star capable of\n", "VG teddy bear /r/CapableOf testify capable of\n", "VG teddy bear /r/CapableOf thrust capable of\n", "VG teddy bear /r/CapableOf tilt capable of\n", "VG teddy bear /r/CapableOf touristed capable of\n", "VG teddy bear /r/CapableOf travel capable of\n", "VG teddy bear /r/CapableOf uncommonness capable of\n", "VG teddy bear /r/CapableOf waking capable of\n", "VG teddy bear /r/CapableOf watch capable of\n", "VG teddy bear /r/CapableOf wear capable of\n", "VG teddy bear /r/CapableOf woodcarving capable of\n", "WD teddy bear /r/HasContext bear depicts\n", "CN teddy bear /r/IsA stuffed animal is a\n", "CN|WN teddy bear /r/IsA plush toys is a\n", "CN|WN teddy bear /r/IsA rack of toys|stand with bears|toy rack is a\n", "CN|WN teddy bear /r/IsA stuffed toy is a\n", "CN|WN teddy bear /r/IsA stuffed toys is a\n", "CN|WN teddy bear /r/IsA toy is a\n", "CN|WN teddy bear /r/IsA toys is a\n", "WD teddy bear /r/IsA stuffed toy subclass of\n", "VG teddy bear /r/LocatedNear abdomen has|white\n", "VG teddy bear /r/LocatedNear adult|person close to\n", "VG teddy bear /r/LocatedNear advertisement|banner|sign above|has|holding|on|sitting on\n", "VG teddy bear /r/LocatedNear advertisement|sign above|has|holding|on|sitting on\n", "VG teddy bear /r/LocatedNear ad|picture in\n", "VG teddy bear /r/LocatedNear air in\n", "VG teddy bear /r/LocatedNear air force in\n", "VG teddy bear /r/LocatedNear air vent in\n", "VG teddy bear /r/LocatedNear air|skies|sky in\n", "VG teddy bear /r/LocatedNear air|sky in\n", "VG teddy bear /r/LocatedNear alcohol visible through\n", "VG teddy bear /r/LocatedNear alcoholic visible through\n", "VG teddy bear /r/LocatedNear american flag holding|wearing|with\n", "VG teddy bear /r/LocatedNear animal holding\n", "VG teddy bear /r/LocatedNear animals holding\n", "VG teddy bear /r/LocatedNear animal|bear holding\n", "VG teddy bear /r/LocatedNear animal|bear behind|in lap of|next to|sitting in between|with a bag on its\n", "VG teddy bear /r/LocatedNear animal|bear|teddy bear holding\n", "VG teddy bear /r/LocatedNear animal|bear|teddy bear behind|in lap of|next to|sitting in between|with a bag on its\n", "VG teddy bear /r/LocatedNear animal|bird holding\n", "VG teddy bear /r/LocatedNear animal|bird hugging\n", "VG teddy bear /r/LocatedNear animal|cat holding\n", "VG teddy bear /r/LocatedNear animal|cat next to\n", "VG teddy bear /r/LocatedNear animal|cow holding\n", "VG teddy bear /r/LocatedNear animal|deer holding\n", "VG teddy bear /r/LocatedNear animal|dog holding\n", "VG teddy bear /r/LocatedNear animal|dog behind|beside|held by|next to\n", "VG teddy bear /r/LocatedNear animal|horse holding\n", "VG teddy bear /r/LocatedNear animal|horse on\n", "VG teddy bear /r/LocatedNear animal|sheep holding\n", "VG teddy bear /r/LocatedNear animal|sheep on|riding|tied to\n", "VG teddy bear /r/LocatedNear animal|zebra holding\n", "VG teddy bear /r/LocatedNear antenna positioned around\n", "VG teddy bear /r/LocatedNear antennae positioned around\n", "VG teddy bear /r/LocatedNear antennas positioned around\n", "VG teddy bear /r/LocatedNear appliance|washing machine in\n", "VG teddy bear /r/LocatedNear apron wearing\n", "VG teddy bear /r/LocatedNear arch laying on\n", "VG teddy bear /r/LocatedNear arciform laying on\n", "VG teddy bear /r/LocatedNear area|dirt|ground laying on|lying on|on|sitting on\n", "VG teddy bear /r/LocatedNear area|grass above|in|kept near\n", "VG teddy bear /r/LocatedNear arm has an|has|in|on|with\n", "VG teddy bear /r/LocatedNear armchair|chair in a|in|kept in|located in|not in|on a|on|sitting in a|sitting in|sitting on|sitting\n", "VG teddy bear /r/LocatedNear arms has an|has|in|on|with\n", "VG teddy bear /r/LocatedNear arm|umpire's arm has an|has|in|on|with\n", "VG teddy bear /r/LocatedNear arrow has a|has|in|wearing a|wearing|with\n", "VG teddy bear /r/LocatedNear arrows has a|has|in|wearing a|wearing|with\n", "VG teddy bear /r/LocatedNear arrow|sign has a|has|in|wearing a|wearing|with\n", "VG teddy bear /r/LocatedNear arrow|sign above|has|holding|on|sitting on\n", "VG teddy bear /r/LocatedNear artwork|frame|picture in\n", "VG teddy bear /r/LocatedNear artwork|picture in\n", "VG teddy bear /r/LocatedNear asphalt|pavement|road on\n", "VG teddy bear /r/LocatedNear athletic shoe has|sitting on|sitting over|wearing\n", "VG teddy bear /r/LocatedNear auricular bear|has a|has an|has\n", "VG teddy bear /r/LocatedNear automobile atop|balanced on|in a|in|on front of|on top of|on|placed on|sitting on\n", "VG teddy bear /r/LocatedNear avifauna hugging\n", "VG teddy bear /r/LocatedNear awning holding\n", "VG teddy bear /r/LocatedNear awnings holding\n", "VG teddy bear /r/LocatedNear axil beside|laying on|lying on|on top of|on|sitting on top of\n", "VG teddy bear /r/LocatedNear babe next to\n", "VG teddy bear /r/LocatedNear babies next to\n", "VG teddy bear /r/LocatedNear baby next to\n", "VG teddy bear /r/LocatedNear baby's arm has an|has|in|on|with\n", "VG teddy bear /r/LocatedNear baby's arm next to\n", "VG teddy bear /r/LocatedNear baby's breath next to\n", "VG teddy bear /r/LocatedNear baby|boy|child|kid in|next to\n", "VG teddy bear /r/LocatedNear baby|boy|child|kid in\n", "VG teddy bear /r/LocatedNear baby|child next to\n", "VG teddy bear /r/LocatedNear baby|child in|next to\n", "VG teddy bear /r/LocatedNear baby|elephant near\n", "VG teddy bear /r/LocatedNear baby|elephant next to\n", "VG teddy bear /r/LocatedNear baby|zebra next to\n", "VG teddy bear /r/LocatedNear back seat on\n", "VG teddy bear /r/LocatedNear background|wall against|attached to|by|hanging from|hanging on|laying next to a|leaning against|near a|next to a|next to|on side of|painted on|sitting next to|sitting on|stuck to\n", "VG teddy bear /r/LocatedNear backpack|bag carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear bag carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear bag of carrots carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear baggage on top of|sits on|wearing\n", "VG teddy bear /r/LocatedNear baggages on top of|sits on|wearing\n", "VG teddy bear /r/LocatedNear bags carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear bags|luggage carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear bags|luggage on top of|sits on|wearing\n", "VG teddy bear /r/LocatedNear bag|luggage carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear bag|luggage on top of|sits on|wearing\n", "VG teddy bear /r/LocatedNear bag|luggage|suitcase carries|in right of|in|lays on|lying on|on\n", "VG teddy bear /r/LocatedNear bag|luggage|suitcase on top of|sits on|wearing\n", "VG teddy bear /r/LocatedNear baking sheet covered by|on top of|on|standing on\n", "VG teddy bear /r/LocatedNear balconies on\n", "VG teddy bear /r/LocatedNear balcony on\n", "VG teddy bear /r/LocatedNear balcony|porch on\n", "VG teddy bear /r/LocatedNear bandage has\n", "VG teddy bear /r/LocatedNear bandaged has\n", "VG teddy bear /r/LocatedNear bandana wearing a\n", "VG teddy bear /r/LocatedNear banister hanging from|hanging on|leaning on|on|set on|sitting on\n", "VG teddy bear /r/LocatedNear banner|sign above|has|holding|on|sitting on\n", "VG teddy bear /r/LocatedNear barbed wire on\n", "VG teddy bear /r/LocatedNear barn in\n", "VG teddy bear /r/LocatedNear barn|building in\n", "VG teddy bear /r/LocatedNear barn|building on side of\n", "VG teddy bear /r/LocatedNear barrel on|sitting on\n", "VG teddy bear /r/LocatedNear barrelled on|sitting on\n", "VG teddy bear /r/LocatedNear barrels on|sitting on\n", "VG teddy bear /r/LocatedNear base on a\n", "VG teddy bear /r/LocatedNear base|pedestal on a\n", "VG teddy bear /r/LocatedNear base|plate carrying a|carrying\n", "VG teddy bear /r/LocatedNear base|stand on a\n", "VG teddy bear /r/LocatedNear basket are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear basketball hoop are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear baskets are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear basket|bowl in\n", "VG teddy bear /r/LocatedNear basket|bowl are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear basket|bowl|dish in\n", "VG teddy bear /r/LocatedNear basket|bowl|dish are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear basket|crate are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear basket|strainer are in|holding|in a|in|lying in|under\n", "VG teddy bear /r/LocatedNear bassinet in\n", "VG teddy bear /r/LocatedNear bath on\n", "VG teddy bear /r/LocatedNear bathing suit|suit|swim suit with\n", "VG teddy bear /r/LocatedNear bathroom|wall|walls against|attached to|by|hanging from|hanging on|laying next to a|leaning against|near a|next to a|next to|on side of|painted on|sitting next to|sitting on|stuck to\n", "VG teddy bear /r/LocatedNear batter's box in a|in|on top of|on|sitting on\n", "VG teddy bear /r/LocatedNear batter|man behind|laying on\n", "VG teddy bear /r/LocatedNear batter|man|player behind|laying on\n" ] } ], "source": [ "!$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \\\n", "--match 'clusters: (cluster:`teddy bear`)-[l]->(n2), cskg: (n2)-[rid {relation: rel_label}]->(object), clusters: (word)-[]->(object)' \\\n", "--return 'distinct rid.source as source, cluster as subject, rel_label as `relation id`, word as value, rel_label.label as relation' \\\n", "--order-by 'cluster, rid.relation, rid.source, word' \\\n", "--limit 250 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.93 real 0.74 user 0.16 sys\n", "source subject relation count\n", "VG teddy bear /r/LocatedNear 1990\n", "VG teddy bear mw:MayHaveProperty 240\n", "VG teddy bear /r/CapableOf 117\n", "CN teddy bear /r/Synonym 7\n", "CN|WN teddy bear /r/IsA 6\n", "CN teddy bear /r/AtLocation 3\n", "CN teddy bear /r/IsA 2\n", "CN teddy bear /r/RelatedTo 2\n", "WD teddy bear /r/HasContext 1\n", "WD teddy bear /r/IsA 1\n", "CN teddy bear /r/UsedFor 1\n" ] } ], "source": [ "command = \"$kypher -i $CSKG/temp.name.clusters.2.tsv.gz -i $KG \\\n", "--match 'clusters: (cluster:`teddy bear`)-[l]->(n2), cskg: (n2)-[rid {relation: rel_label}]->(object), clusters: (word)-[]->(object)' \\\n", "--return 'distinct rid.source as source, cluster as subject, rel_label as relation, count(rel_label) as count' \\\n", "--order-by 'cluster, count(rel_label) desc, rid.relation, rid.source' \\\n", "--limit 250\" \n", "data = shell_df(command, shell=True, sep='\\t')\n", "bar_chart(data, 'nodes', 'relation')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: kgtk sort2 [-h] [-i INPUT] [-o OUTPUT_FILE]\n", " [-c [COLUMNS [COLUMNS ...]]] [--locale LOCALE]\n", " [-r [True|False]] [--pure-python [True|False]] [-X EXTRA]\n", " [-v]\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -i INPUT, --input-file INPUT\n", " Input file to sort. (May be omitted or '-' for stdin.)\n", " -o OUTPUT_FILE, --out OUTPUT_FILE, --output-file OUTPUT_FILE\n", " Output file to write to. (May be omitted or '-' for\n", " stdout.)\n", " -c [COLUMNS [COLUMNS ...]], --column [COLUMNS [COLUMNS ...]], --columns [COLUMNS [COLUMNS ...]]\n", " comma-separated list of column names to sort on.\n", " (defaults to id for node files, (node1, label, node2)\n", " for edge files without ID, (id, node1, label, node2)\n", " for edge files with ID)\n", " --locale LOCALE LC_ALL locale controls the sorting order. (default=C)\n", " -r [True|False], --reverse [True|False]\n", " When True, generate output in reverse sort order.\n", " (default=False)\n", " --pure-python [True|False]\n", " When True, sort in-memory with Python code.\n", " (default=False)\n", " -X EXTRA, --extra EXTRA\n", " extra options to supply to the sort program.\n", " (default=None)\n", "\n", " -v, --verbose Print additional progress messages (default=False).\n" ] } ], "source": [ "!kgtk sort2 --help" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: kgtk [options] command [ / command]*\n", "\n", "kgtk --- Knowledge Graph Toolkit\n", "\n", "positional arguments:\n", " command\n", " add-id Copy a KGTK file, adding ID values.\n", " calc Perform calculations on KGTK file columns.\n", " cat Concatenate KGTK files.\n", " clean-data Validate a KGTK file and output a clean copy: no\n", " comments, whitespace lines, invalid lines, etc.\n", " compact Copy a KGTK file compacting | lists.\n", " connected-components\n", " Find connected components in a Graph.\n", " expand Copy a KGTK file expanding | lists.\n", " explode (denormalize_node2)\n", " Copy a KGTK file, exploding one column (usualy node2)\n", " into seperate columns for each subfield.\n", " export-gt Export a KGTK file to Graph-tool format.\n", " export-neo4j Exports data to Neo4J Cypher Query Language\n", " statements.\n", " export-wikidata Export wikidata from a set of KGTK files.\n", " filter Filter rows by subject, predicate, object values.\n", " generate-mediawiki-jsons\n", " Generates mediawiki json responses from kgtk file\n", " generate-wikidata-triples\n", " Generates wikidata triples from kgtk file\n", " graph-statistics Import a CSV file in Graph-tool.\n", " ifempty Filter a KGTK file for empty fields.\n", " ifexists Filter a KGTK file by matching records in a second\n", " KGTK file.\n", " ifnotempty Filter a KGTK file for nonempty fields.\n", " ifnotexists Filter a KGTK file by not matching records in a second\n", " KGTK file.\n", " implode Copy a KGTK file, building one column (usualy node2)\n", " from seperate columns for each subfield.\n", " import-atomic Import ATOMIC into KGTK.\n", " import-concept-pairs\n", " Import concept pairs into KGTK.\n", " import-conceptnet Import ConceptNet into KGTK.\n", " import-framenet Import FrameNet into KGTK.\n", " import-ntriples Import an ntriples file.\n", " import-visualgenome\n", " Import Visual Genome into KGTK.\n", " import-wikidata Import an wikidata file into KGTK file\n", " import-wordnet Import WordNet into KGTK.\n", " join Join two KGTK files\n", " lift Lift labels from a KGTK file.\n", " lower Normalize a KGTK edge file by reversing the \"lift\"\n", " pattern.\n", " md Convert a KGTK file to a GitHub Markdown Table.\n", " merge-identical-nodes\n", " Merge identical nodes and deduplicate.\n", " normalize-nodes Normalize a KGTK node file into a KGTK edge file.\n", " paths Compute paths between nodes in a KGTK graph.\n", " query Query one or more KGTK files with Kypher\n", " reachable-nodes Find reachable nodes in a graph.\n", " remove-columns Remove columns from a KGTK file.\n", " rename-columns Rename KGTK file columns.\n", " reorder-columns Reorder KGTK file columns.\n", " sort Sort file based on one or more columns\n", " sort2 Sort file based on one or more columns\n", " text-embedding Produce embedding vectors on given file's nodes.\n", " unique Count unique values in a column.\n", " unreify-rdf-statements\n", " Unreify RDF statements in a KGTK file.\n", " unreify-values Unreify values in a KGTK file.\n", " validate-properties\n", " Validate property patterns in a KGTK file.\n", " validate Validate one or more KGTK files\n", " zconcat Concatenate any mixture of plain or gzip/bzip2/xz-\n", " compressed files\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -V, --version show KGTK version number and exit.\n", "\n", "shared optional arguments:\n", " --debug enable debug mode\n", " --expert enable expert mode\n", " --pipedebug enable pipe debug mode\n", " --progress enable progress monitoring\n", " --progress-tty _PROGRESS_TTY\n", " progress monitoring output tty\n", " --timing enable timing measurements\n" ] } ], "source": [ "!kgtk --help" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kgtk", "language": "python", "name": "kgtk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }