{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# KGTK Tutorial: Introduction\n", "\n", "We begin the tutorial with a quick overview of some of the commands in KGTK. Then we turn our attention to working with Wikidata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tutorial Setup\n", "\n", "Import utility functions and define environment variables for the folders and files that we will use" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ALIAS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/aliases.en.tsv.gz\"\n", "ALL: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/all.tsv.gz\"\n", "CLAIMS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.tsv.gz\"\n", "DESCRIPTION: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/descriptions.en.tsv.gz\"\n", "EXAMPLES_DIR: \"/Users/pedroszekely/Documents/GitHub/kgtk/examples\"\n", "GE: \"/Users/pedroszekely/Downloads/kgtk-tutorial/temp/graph-embedding\"\n", "ISA: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.isa.tsv.gz\"\n", "ITEM: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.wikibase-item.tsv.gz\"\n", "KGTK_PATH: \"/Users/pedroszekely/Documents/GitHub/kgtk\"\n", "LABEL: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/labels.en.tsv.gz\"\n", "OUT: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output\"\n", "P279: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.P279.tsv.gz\"\n", "P279STAR: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.P279star.tsv.gz\"\n", "PROPERTY_DATATYPES: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/metadata.property.datatypes.tsv.gz\"\n", "Q154ALIAS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/aliases.en.tsv.gz\"\n", "Q154ALL: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/all.tsv.gz\"\n", "Q154CLAIMS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/claims.tsv.gz\"\n", "Q154DESCRIPTION: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/descriptions.en.tsv.gz\"\n", "Q154ISA: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/derived.isa.tsv.gz\"\n", "Q154ITEM: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/claims.wikibase-item.tsv.gz\"\n", "Q154LABEL: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/labels.en.tsv.gz\"\n", "Q154P279: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/derived.P279.tsv.gz\"\n", "Q154P279STAR: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/derived.P279star.tsv.gz\"\n", "Q154PROPERTY_DATATYPES: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/metadata.property.datatypes.tsv.gz\"\n", "Q154QUALIFIERS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/qualifiers.tsv.gz\"\n", "Q154QUALIFIERS_TIME: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/qualifiers.time.tsv.gz\"\n", "Q154SITELINKS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/output/parts/sitelinks.tsv.gz\"\n", "QUALIFIERS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/qualifiers.tsv.gz\"\n", "QUALIFIERS_TIME: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/qualifiers.time.tsv.gz\"\n", "SITELINKS: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/sitelinks.tsv.gz\"\n", "STORE: \"/Users/pedroszekely/Downloads/kgtk-tutorial/wikidata.sqlite3.miniwikidata.db\"\n", "TE: \"/Users/pedroszekely/Downloads/kgtk-tutorial/temp/text-embedding\"\n", "TEMP: \"/Users/pedroszekely/Downloads/kgtk-tutorial/temp\"\n", "USECASE_DIR: \"/Users/pedroszekely/Documents/GitHub/kgtk/use-cases\"\n", "WIKIDATA: \"/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/\"\n", "kgtk: \"kgtk --debug\"\n", "kypher: \"kgtk query --graph-cache /Users/pedroszekely/Downloads/kgtk-tutorial/wikidata.sqlite3.miniwikidata.db\"\n" ] } ], "source": [ "import sys \n", "sys.path.insert(0, 'tutorial')\n", "from tutorial_setup import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/pedroszekely/Downloads/kgtk-tutorial\n" ] } ], "source": [ "!mkdir -p {output_path}\n", "%cd {output_path}" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "!mkdir -p {output_folder}\n", "!mkdir -p {temp_folder}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "!mkdir -p \"$GE\"\n", "!mkdir -p \"$TE\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick Tour Of KGTK Commands\n", "\n", "Our sample input file:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0terminator2_jdlabel\"Terminator 2\"@en
1terminator2_jdinstance_offilm
2terminator2_jdgenrescience_fiction
3terminator2_jdgenreaction
4t4terminator2_jdcasta_schwarzenegger
5t4roleterminator
6t6terminator2_jdcastl_hamilton
7t6roles_connor
8t8terminator2_jdawardacademy-best-sound-editing
9t8point_in_time^1992-03-30T00:00:00Z/11
10t8winnerg_rydstrom
11t8winnerg_borders
12l_hamiltonlabel\"Linda Hamilton\"@en
13a_schwarzeneggerlabel\"Arnold Schwarzenegger\"@en
14filmsubclass_ofvisual_artwork
15terminator2_jdpublication_date^1984-10-26T00:00:00Z/11
16t15locationunited_states
17terminator2_jdpublication_date^1985-02-08T00:00:00Z/11
18t17locationsweden
19terminator2_jdduration108minute
20instance_oflabel\"instance of\"@en
\n", "
" ], "text/plain": [ " id node1 label node2\n", "0 terminator2_jd label \"Terminator 2\"@en\n", "1 terminator2_jd instance_of film\n", "2 terminator2_jd genre science_fiction\n", "3 terminator2_jd genre action\n", "4 t4 terminator2_jd cast a_schwarzenegger\n", "5 t4 role terminator\n", "6 t6 terminator2_jd cast l_hamilton\n", "7 t6 role s_connor\n", "8 t8 terminator2_jd award academy-best-sound-editing\n", "9 t8 point_in_time ^1992-03-30T00:00:00Z/11\n", "10 t8 winner g_rydstrom\n", "11 t8 winner g_borders\n", "12 l_hamilton label \"Linda Hamilton\"@en\n", "13 a_schwarzenegger label \"Arnold Schwarzenegger\"@en\n", "14 film subclass_of visual_artwork\n", "15 terminator2_jd publication_date ^1984-10-26T00:00:00Z/11\n", "16 t15 location united_states\n", "17 terminator2_jd publication_date ^1985-02-08T00:00:00Z/11\n", "18 t17 location sweden\n", "19 terminator2_jd duration 108minute\n", "20 instance_of label \"instance of\"@en" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk cat -i \"$KGTK_PATH\"/tutorial/datasets/movies.tsv\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many edges are missing ids, let's add ids for them. We are adding wikidata-style ids, but there are many other styles:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0terminator2_jd-label-01de63terminator2_jdlabel\"Terminator 2\"@en
1terminator2_jd-instance_of-d0607fterminator2_jdinstance_offilm
2terminator2_jd-genre-2e6128terminator2_jdgenrescience_fiction
3terminator2_jd-genre-bd938cterminator2_jdgenreaction
4t4terminator2_jdcasta_schwarzenegger
5t4-role-aa802ft4roleterminator
6t6terminator2_jdcastl_hamilton
7t6-role-a29a51t6roles_connor
8t8terminator2_jdawardacademy-best-sound-editing
9t8-point_in_time-370fact8point_in_time^1992-03-30T00:00:00Z/11
10t8-winner-dc3cdat8winnerg_rydstrom
11t8-winner-211455t8winnerg_borders
12l_hamilton-label-2b3667l_hamiltonlabel\"Linda Hamilton\"@en
13a_schwarzenegger-label-2a4c28a_schwarzeneggerlabel\"Arnold Schwarzenegger\"@en
14film-subclass_of-f126abfilmsubclass_ofvisual_artwork
15terminator2_jd-publication_date-e29331terminator2_jdpublication_date^1984-10-26T00:00:00Z/11
16t15-location-303f2at15locationunited_states
17terminator2_jd-publication_date-6aeb53terminator2_jdpublication_date^1985-02-08T00:00:00Z/11
18t17-location-295099t17locationsweden
19terminator2_jd-duration-79d04dterminator2_jdduration108minute
20instance_of-label-0e46afinstance_oflabel\"instance of\"@en
\n", "
" ], "text/plain": [ " id node1 \\\n", "0 terminator2_jd-label-01de63 terminator2_jd \n", "1 terminator2_jd-instance_of-d0607f terminator2_jd \n", "2 terminator2_jd-genre-2e6128 terminator2_jd \n", "3 terminator2_jd-genre-bd938c terminator2_jd \n", "4 t4 terminator2_jd \n", "5 t4-role-aa802f t4 \n", "6 t6 terminator2_jd \n", "7 t6-role-a29a51 t6 \n", "8 t8 terminator2_jd \n", "9 t8-point_in_time-370fac t8 \n", "10 t8-winner-dc3cda t8 \n", "11 t8-winner-211455 t8 \n", "12 l_hamilton-label-2b3667 l_hamilton \n", "13 a_schwarzenegger-label-2a4c28 a_schwarzenegger \n", "14 film-subclass_of-f126ab film \n", "15 terminator2_jd-publication_date-e29331 terminator2_jd \n", "16 t15-location-303f2a t15 \n", "17 terminator2_jd-publication_date-6aeb53 terminator2_jd \n", "18 t17-location-295099 t17 \n", "19 terminator2_jd-duration-79d04d terminator2_jd \n", "20 instance_of-label-0e46af instance_of \n", "\n", " label node2 \n", "0 label \"Terminator 2\"@en \n", "1 instance_of film \n", "2 genre science_fiction \n", "3 genre action \n", "4 cast a_schwarzenegger \n", "5 role terminator \n", "6 cast l_hamilton \n", "7 role s_connor \n", "8 award academy-best-sound-editing \n", "9 point_in_time ^1992-03-30T00:00:00Z/11 \n", "10 winner g_rydstrom \n", "11 winner g_borders \n", "12 label \"Linda Hamilton\"@en \n", "13 label \"Arnold Schwarzenegger\"@en \n", "14 subclass_of visual_artwork \n", "15 publication_date ^1984-10-26T00:00:00Z/11 \n", "16 location united_states \n", "17 publication_date ^1985-02-08T00:00:00Z/11 \n", "18 location sweden \n", "19 duration 108minute \n", "20 label \"instance of\"@en " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk add-id --id-style wikidata -i \"$KGTK_PATH\"/tutorial/datasets/movies.tsv\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put the new version of the movies with ids in a file:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "!$kgtk add-id --id-style wikidata -i \"$KGTK_PATH\"/tutorial/datasets/movies.tsv \\\n", "-o \"$TEMP\"/movies.ids.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort the file by id (there are many other ways to sort):" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0a_schwarzenegger-label-2a4c28a_schwarzeneggerlabel\"Arnold Schwarzenegger\"@en
1film-subclass_of-f126abfilmsubclass_ofvisual_artwork
2instance_of-label-0e46afinstance_oflabel\"instance of\"@en
3l_hamilton-label-2b3667l_hamiltonlabel\"Linda Hamilton\"@en
4t15-location-303f2at15locationunited_states
5t17-location-295099t17locationsweden
6t4terminator2_jdcasta_schwarzenegger
7t4-role-aa802ft4roleterminator
8t6terminator2_jdcastl_hamilton
9t6-role-a29a51t6roles_connor
10t8terminator2_jdawardacademy-best-sound-editing
11t8-point_in_time-370fact8point_in_time^1992-03-30T00:00:00Z/11
12t8-winner-211455t8winnerg_borders
13t8-winner-dc3cdat8winnerg_rydstrom
14terminator2_jd-duration-79d04dterminator2_jdduration108minute
15terminator2_jd-genre-2e6128terminator2_jdgenrescience_fiction
16terminator2_jd-genre-bd938cterminator2_jdgenreaction
17terminator2_jd-instance_of-d0607fterminator2_jdinstance_offilm
18terminator2_jd-label-01de63terminator2_jdlabel\"Terminator 2\"@en
19terminator2_jd-publication_date-6aeb53terminator2_jdpublication_date^1985-02-08T00:00:00Z/11
20terminator2_jd-publication_date-e29331terminator2_jdpublication_date^1984-10-26T00:00:00Z/11
\n", "
" ], "text/plain": [ " id node1 \\\n", "0 a_schwarzenegger-label-2a4c28 a_schwarzenegger \n", "1 film-subclass_of-f126ab film \n", "2 instance_of-label-0e46af instance_of \n", "3 l_hamilton-label-2b3667 l_hamilton \n", "4 t15-location-303f2a t15 \n", "5 t17-location-295099 t17 \n", "6 t4 terminator2_jd \n", "7 t4-role-aa802f t4 \n", "8 t6 terminator2_jd \n", "9 t6-role-a29a51 t6 \n", "10 t8 terminator2_jd \n", "11 t8-point_in_time-370fac t8 \n", "12 t8-winner-211455 t8 \n", "13 t8-winner-dc3cda t8 \n", "14 terminator2_jd-duration-79d04d terminator2_jd \n", "15 terminator2_jd-genre-2e6128 terminator2_jd \n", "16 terminator2_jd-genre-bd938c terminator2_jd \n", "17 terminator2_jd-instance_of-d0607f terminator2_jd \n", "18 terminator2_jd-label-01de63 terminator2_jd \n", "19 terminator2_jd-publication_date-6aeb53 terminator2_jd \n", "20 terminator2_jd-publication_date-e29331 terminator2_jd \n", "\n", " label node2 \n", "0 label \"Arnold Schwarzenegger\"@en \n", "1 subclass_of visual_artwork \n", "2 label \"instance of\"@en \n", "3 label \"Linda Hamilton\"@en \n", "4 location united_states \n", "5 location sweden \n", "6 cast a_schwarzenegger \n", "7 role terminator \n", "8 cast l_hamilton \n", "9 role s_connor \n", "10 award academy-best-sound-editing \n", "11 point_in_time ^1992-03-30T00:00:00Z/11 \n", "12 winner g_borders \n", "13 winner g_rydstrom \n", "14 duration 108minute \n", "15 genre science_fiction \n", "16 genre action \n", "17 instance_of film \n", "18 label \"Terminator 2\"@en \n", "19 publication_date ^1985-02-08T00:00:00Z/11 \n", "20 publication_date ^1984-10-26T00:00:00Z/11 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk sort -i \"$TEMP\"/movies.ids.tsv\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is nice to be able to see the labels of the nodes. We can use the lift command to lift the lables from rows to columns (It is possible to lift other relations too):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2node1;labellabel;labelnode2;label
0terminator2_jd-instance_of-d0607fterminator2_jdinstance_offilm\"Terminator 2\"@en\"instance of\"@en
1terminator2_jd-genre-2e6128terminator2_jdgenrescience_fiction\"Terminator 2\"@en
2terminator2_jd-genre-bd938cterminator2_jdgenreaction\"Terminator 2\"@en
3t4terminator2_jdcasta_schwarzenegger\"Terminator 2\"@en\"Arnold Schwarzenegger\"@en
4t4-role-aa802ft4roleterminator
5t6terminator2_jdcastl_hamilton\"Terminator 2\"@en\"Linda Hamilton\"@en
6t6-role-a29a51t6roles_connor
7t8terminator2_jdawardacademy-best-sound-editing\"Terminator 2\"@en
8t8-point_in_time-370fact8point_in_time^1992-03-30T00:00:00Z/11
9t8-winner-dc3cdat8winnerg_rydstrom
10t8-winner-211455t8winnerg_borders
11film-subclass_of-f126abfilmsubclass_ofvisual_artwork
12terminator2_jd-publication_date-e29331terminator2_jdpublication_date^1984-10-26T00:00:00Z/11\"Terminator 2\"@en
13t15-location-303f2at15locationunited_states
14terminator2_jd-publication_date-6aeb53terminator2_jdpublication_date^1985-02-08T00:00:00Z/11\"Terminator 2\"@en
15t17-location-295099t17locationsweden
16terminator2_jd-duration-79d04dterminator2_jdduration108minute\"Terminator 2\"@en
\n", "
" ], "text/plain": [ " id node1 label \\\n", "0 terminator2_jd-instance_of-d0607f terminator2_jd instance_of \n", "1 terminator2_jd-genre-2e6128 terminator2_jd genre \n", "2 terminator2_jd-genre-bd938c terminator2_jd genre \n", "3 t4 terminator2_jd cast \n", "4 t4-role-aa802f t4 role \n", "5 t6 terminator2_jd cast \n", "6 t6-role-a29a51 t6 role \n", "7 t8 terminator2_jd award \n", "8 t8-point_in_time-370fac t8 point_in_time \n", "9 t8-winner-dc3cda t8 winner \n", "10 t8-winner-211455 t8 winner \n", "11 film-subclass_of-f126ab film subclass_of \n", "12 terminator2_jd-publication_date-e29331 terminator2_jd publication_date \n", "13 t15-location-303f2a t15 location \n", "14 terminator2_jd-publication_date-6aeb53 terminator2_jd publication_date \n", "15 t17-location-295099 t17 location \n", "16 terminator2_jd-duration-79d04d terminator2_jd duration \n", "\n", " node2 node1;label label;label \\\n", "0 film \"Terminator 2\"@en \"instance of\"@en \n", "1 science_fiction \"Terminator 2\"@en \n", "2 action \"Terminator 2\"@en \n", "3 a_schwarzenegger \"Terminator 2\"@en \n", "4 terminator \n", "5 l_hamilton \"Terminator 2\"@en \n", "6 s_connor \n", "7 academy-best-sound-editing \"Terminator 2\"@en \n", "8 ^1992-03-30T00:00:00Z/11 \n", "9 g_rydstrom \n", "10 g_borders \n", "11 visual_artwork \n", "12 ^1984-10-26T00:00:00Z/11 \"Terminator 2\"@en \n", "13 united_states \n", "14 ^1985-02-08T00:00:00Z/11 \"Terminator 2\"@en \n", "15 sweden \n", "16 108minute \"Terminator 2\"@en \n", "\n", " node2;label \n", "0 \n", "1 \n", "2 \n", "3 \"Arnold Schwarzenegger\"@en \n", "4 \n", "5 \"Linda Hamilton\"@en \n", "6 \n", "7 \n", "8 \n", "9 \n", "10 \n", "11 \n", "12 \n", "13 \n", "14 \n", "15 \n", "16 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk lift -i \"$TEMP\"/movies.ids.tsv \n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The KGTK equivalent of grep:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0terminator2_jd-genre-2e6128terminator2_jdgenrescience_fiction
1terminator2_jd-genre-bd938cterminator2_jdgenreaction
2t4terminator2_jdcasta_schwarzenegger
3t6terminator2_jdcastl_hamilton
\n", "
" ], "text/plain": [ " id node1 label node2\n", "0 terminator2_jd-genre-2e6128 terminator2_jd genre science_fiction\n", "1 terminator2_jd-genre-bd938c terminator2_jd genre action\n", "2 t4 terminator2_jd cast a_schwarzenegger\n", "3 t6 terminator2_jd cast l_hamilton" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";cast,genre;\"\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filter also supports regular expressioins. Here are the edges that have `mi` somewhere and end with `@en`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0terminator2_jd-label-01de63terminator2_jdlabel\"Terminator 2\"@en
1l_hamilton-label-2b3667l_hamiltonlabel\"Linda Hamilton\"@en
\n", "
" ], "text/plain": [ " id node1 label node2\n", "0 terminator2_jd-label-01de63 terminator2_jd label \"Terminator 2\"@en\n", "1 l_hamilton-label-2b3667 l_hamilton label \"Linda Hamilton\"@en" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";;mi.*@en\" --regex --match-type search\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `md` command makes it easy to convert the output to markdown:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| id | node1 | label | node2 |\n", "| -- | -- | -- | -- |\n", "| terminator2_jd-genre-2e6128 | terminator2_jd | genre | science_fiction |\n", "| terminator2_jd-genre-bd938c | terminator2_jd | genre | action |\n", "| t4 | terminator2_jd | cast | a_schwarzenegger |\n", "| t6 | terminator2_jd | cast | l_hamilton |\n" ] } ], "source": [ "!$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";cast,genre;\" / md " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `cat` command has many output formats, so we can output CSV:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id,node1,label,node2\n", "terminator2_jd-genre-2e6128,terminator2_jd,genre,science_fiction\n", "terminator2_jd-genre-bd938c,terminator2_jd,genre,action\n", "t4,terminator2_jd,cast,a_schwarzenegger\n", "t6,terminator2_jd,cast,l_hamilton\n" ] } ], "source": [ "!$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";cast,genre;\" / cat --output-format csv " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can also output JSON (and several other formats):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\n", "{\"id\":\"terminator2_jd-genre-2e6128\",\"node1\":\"terminator2_jd\",\"label\":\"genre\",\"node2\":\"science_fiction\"},\n", "{\"id\":\"terminator2_jd-genre-bd938c\",\"node1\":\"terminator2_jd\",\"label\":\"genre\",\"node2\":\"action\"},\n", "{\"id\":\"t4\",\"node1\":\"terminator2_jd\",\"label\":\"cast\",\"node2\":\"a_schwarzenegger\"},\n", "{\"id\":\"t6\",\"node1\":\"terminator2_jd\",\"label\":\"cast\",\"node2\":\"l_hamilton\"}\n", "]\n" ] } ], "source": [ "!$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";cast,genre;\" / cat --output-format json-map " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove the `id` and `label` columns" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1node2
0terminator2_jdscience_fiction
1terminator2_jdaction
2terminator2_jda_schwarzenegger
3terminator2_jdl_hamilton
\n", "
" ], "text/plain": [ " node1 node2\n", "0 terminator2_jd science_fiction\n", "1 terminator2_jd action\n", "2 terminator2_jd a_schwarzenegger\n", "3 terminator2_jd l_hamilton" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";cast,genre;\" \\\n", "/ remove-columns -c id label\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In one go remove the columns we don't want and then rename them to good names:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_idtitle
0terminator2_jdscience_fiction
1terminator2_jdaction
2terminator2_jda_schwarzenegger
3terminator2_jdl_hamilton
\n", "
" ], "text/plain": [ " movie_id title\n", "0 terminator2_jd science_fiction\n", "1 terminator2_jd action\n", "2 terminator2_jd a_schwarzenegger\n", "3 terminator2_jd l_hamilton" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk filter -i \"$TEMP\"/movies.ids.tsv -p \";cast,genre;\" \\\n", "/ remove-columns -c id label \\\n", "/ rename-columns --mode NONE --output-columns movie_id title \n", "\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Count the number of distinct values in column `label`:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2
0awardcount1
1durationcount1
2instance_ofcount1
3point_in_timecount1
4subclass_ofcount1
5castcount2
6genrecount2
7locationcount2
8publication_datecount2
9rolecount2
10winnercount2
11labelcount4
\n", "
" ], "text/plain": [ " node1 label node2\n", "0 award count 1\n", "1 duration count 1\n", "2 instance_of count 1\n", "3 point_in_time count 1\n", "4 subclass_of count 1\n", "5 cast count 2\n", "6 genre count 2\n", "7 location count 2\n", "8 publication_date count 2\n", "9 role count 2\n", "10 winner count 2\n", "11 label count 4" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk unique -i \"$TEMP\"/movies.ids.tsv --column label / sort -c node2\n", "\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Expand the structured literals into columns with the consittuents to make it easy for developers to parse the structured literals:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2node2;kgtk:data_typenode2;kgtk:validnode2;kgtk:list_lennode2;kgtk:numbernode2;kgtk:low_tolerancenode2;kgtk:high_tolerance...node2;kgtk:units_nodenode2;kgtk:textnode2;kgtk:languagenode2;kgtk:language_suffixnode2;kgtk:latitudenode2;kgtk:longitudenode2;kgtk:date_and_timenode2;kgtk:precisionnode2;kgtk:truthnode2;kgtk:symbol
0terminator2_jd-label-01de63terminator2_jdlabel\"Terminator 2\"@en...
1terminator2_jd-instance_of-d0607fterminator2_jdinstance_offilmsymbolTrue0...film
2terminator2_jd-genre-2e6128terminator2_jdgenrescience_fictionsymbolTrue0...science_fiction
3terminator2_jd-genre-bd938cterminator2_jdgenreactionsymbolTrue0...action
4t4terminator2_jdcasta_schwarzeneggersymbolTrue0...a_schwarzenegger
5t4-role-aa802ft4roleterminatorsymbolTrue0...terminator
6t6terminator2_jdcastl_hamiltonsymbolTrue0...l_hamilton
7t6-role-a29a51t6roles_connorsymbolTrue0...s_connor
8t8terminator2_jdawardacademy-best-sound-editingsymbolTrue0...academy-best-sound-editing
9t8-point_in_time-370fact8point_in_time^1992-03-30T00:00:00Z/11date_and_timesTrue0...\"1992-03-30T00:00:00Z\"11
10t8-winner-dc3cdat8winnerg_rydstromsymbolTrue0...g_rydstrom
11t8-winner-211455t8winnerg_borderssymbolTrue0...g_borders
12l_hamilton-label-2b3667l_hamiltonlabel\"Linda Hamilton\"@en...
13a_schwarzenegger-label-2a4c28a_schwarzeneggerlabel\"Arnold Schwarzenegger\"@en...
14film-subclass_of-f126abfilmsubclass_ofvisual_artworksymbolTrue0...visual_artwork
15terminator2_jd-publication_date-e29331terminator2_jdpublication_date^1984-10-26T00:00:00Z/11date_and_timesTrue0...\"1984-10-26T00:00:00Z\"11
16t15-location-303f2at15locationunited_statessymbolTrue0...united_states
17terminator2_jd-publication_date-6aeb53terminator2_jdpublication_date^1985-02-08T00:00:00Z/11date_and_timesTrue0...\"1985-02-08T00:00:00Z\"11
18t17-location-295099t17locationswedensymbolTrue0...sweden
19terminator2_jd-duration-79d04dterminator2_jdduration108minute...
20instance_of-label-0e46afinstance_oflabel\"instance of\"@en...
\n", "

21 rows × 21 columns

\n", "
" ], "text/plain": [ " id node1 \\\n", "0 terminator2_jd-label-01de63 terminator2_jd \n", "1 terminator2_jd-instance_of-d0607f terminator2_jd \n", "2 terminator2_jd-genre-2e6128 terminator2_jd \n", "3 terminator2_jd-genre-bd938c terminator2_jd \n", "4 t4 terminator2_jd \n", "5 t4-role-aa802f t4 \n", "6 t6 terminator2_jd \n", "7 t6-role-a29a51 t6 \n", "8 t8 terminator2_jd \n", "9 t8-point_in_time-370fac t8 \n", "10 t8-winner-dc3cda t8 \n", "11 t8-winner-211455 t8 \n", "12 l_hamilton-label-2b3667 l_hamilton \n", "13 a_schwarzenegger-label-2a4c28 a_schwarzenegger \n", "14 film-subclass_of-f126ab film \n", "15 terminator2_jd-publication_date-e29331 terminator2_jd \n", "16 t15-location-303f2a t15 \n", "17 terminator2_jd-publication_date-6aeb53 terminator2_jd \n", "18 t17-location-295099 t17 \n", "19 terminator2_jd-duration-79d04d terminator2_jd \n", "20 instance_of-label-0e46af instance_of \n", "\n", " label node2 node2;kgtk:data_type \\\n", "0 label \"Terminator 2\"@en \n", "1 instance_of film symbol \n", "2 genre science_fiction symbol \n", "3 genre action symbol \n", "4 cast a_schwarzenegger symbol \n", "5 role terminator symbol \n", "6 cast l_hamilton symbol \n", "7 role s_connor symbol \n", "8 award academy-best-sound-editing symbol \n", "9 point_in_time ^1992-03-30T00:00:00Z/11 date_and_times \n", "10 winner g_rydstrom symbol \n", "11 winner g_borders symbol \n", "12 label \"Linda Hamilton\"@en \n", "13 label \"Arnold Schwarzenegger\"@en \n", "14 subclass_of visual_artwork symbol \n", "15 publication_date ^1984-10-26T00:00:00Z/11 date_and_times \n", "16 location united_states symbol \n", "17 publication_date ^1985-02-08T00:00:00Z/11 date_and_times \n", "18 location sweden symbol \n", "19 duration 108minute \n", "20 label \"instance of\"@en \n", "\n", " node2;kgtk:valid node2;kgtk:list_len node2;kgtk:number \\\n", "0 \n", "1 True 0 \n", "2 True 0 \n", "3 True 0 \n", "4 True 0 \n", "5 True 0 \n", "6 True 0 \n", "7 True 0 \n", "8 True 0 \n", "9 True 0 \n", "10 True 0 \n", "11 True 0 \n", "12 \n", "13 \n", "14 True 0 \n", "15 True 0 \n", "16 True 0 \n", "17 True 0 \n", "18 True 0 \n", "19 \n", "20 \n", "\n", " node2;kgtk:low_tolerance node2;kgtk:high_tolerance ... \\\n", "0 ... \n", "1 ... \n", "2 ... \n", "3 ... \n", "4 ... \n", "5 ... \n", "6 ... \n", "7 ... \n", "8 ... \n", "9 ... \n", "10 ... \n", "11 ... \n", "12 ... \n", "13 ... \n", "14 ... \n", "15 ... \n", "16 ... \n", "17 ... \n", "18 ... \n", "19 ... \n", "20 ... \n", "\n", " node2;kgtk:units_node node2;kgtk:text node2;kgtk:language \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 \n", "6 \n", "7 \n", "8 \n", "9 \n", "10 \n", "11 \n", "12 \n", "13 \n", "14 \n", "15 \n", "16 \n", "17 \n", "18 \n", "19 \n", "20 \n", "\n", " node2;kgtk:language_suffix node2;kgtk:latitude node2;kgtk:longitude \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 \n", "6 \n", "7 \n", "8 \n", "9 \n", "10 \n", "11 \n", "12 \n", "13 \n", "14 \n", "15 \n", "16 \n", "17 \n", "18 \n", "19 \n", "20 \n", "\n", " node2;kgtk:date_and_time node2;kgtk:precision node2;kgtk:truth \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 \n", "6 \n", "7 \n", "8 \n", "9 \"1992-03-30T00:00:00Z\" 11 \n", "10 \n", "11 \n", "12 \n", "13 \n", "14 \n", "15 \"1984-10-26T00:00:00Z\" 11 \n", "16 \n", "17 \"1985-02-08T00:00:00Z\" 11 \n", "18 \n", "19 \n", "20 \n", "\n", " node2;kgtk:symbol \n", "0 \n", "1 film \n", "2 science_fiction \n", "3 action \n", "4 a_schwarzenegger \n", "5 terminator \n", "6 l_hamilton \n", "7 s_connor \n", "8 academy-best-sound-editing \n", "9 \n", "10 g_rydstrom \n", "11 g_borders \n", "12 \n", "13 \n", "14 visual_artwork \n", "15 \n", "16 united_states \n", "17 \n", "18 sweden \n", "19 \n", "20 \n", "\n", "[21 rows x 21 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lines = !$kgtk explode -i \"$TEMP\"/movies.ids.tsv\n", "kgtk_to_dataframe(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Wikidata in KGTK\n", "KGTK has the ability to import a Wikidata JSON dump and covert it to the KGTK representation to make it easy to process the full Wikidata KG in a laptop. There are 86 files which include all the information available in the Wikidata dump and files containing commonly used information derived from the dump. We partitioned the files because in most use cases you only need to use a subset of the files.\n", "\n", "The files are very large. `claims.tsv` (23GB compressed) contains all the statements in the Wikidata dump, `qualifiers.tsv` contains the qualifiers of those edges, and `labels.en.tsv`, `aliases.en.tsv` and `descriptions.en.tsv` contain the English labels, aliases and descriptions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 pedroszekely staff 32M Jan 24 00:32 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/aliases.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 1.7G Jan 24 00:30 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 122M Jan 24 00:33 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/descriptions.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 167M Jan 24 00:35 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/labels.en.tsv.gz\n", "-rw-r--r-- 1 pedroszekely staff 264M Jan 24 00:32 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/qualifiers.tsv.gz\n" ] } ], "source": [ "!ls -lh \"$CLAIMS\" \"$QUALIFIERS\" \"$LABEL\" \"$ALIAS\" \"$DESCRIPTION\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`claims.tsv` contains many edges:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 94123796 587802328 7562639743\n" ] } ], "source": [ "!zcat < \"$CLAIMS\" | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# KGTK Data Model\n", "The KGTK data model is a generalization of RDF and property graphs, inspired by the Wikidata data model. In KGTK, a KG is represented using TSV files with four columns: three columns to store the subject, predicate and object of a triple, and a fourth column to store an identifier for the triple. By convention, we use the heading `id` for the identifier, `node1` for the subject, `node2` for the object and `label` for the predicate, as it labels the edge between `node1` and `node2`. The order of the columns is arbitrary.\n", "\n", "All KGTK files must include the required `id`, `node1`, `label` and `node2` columns, and can contain additional columns to store addtional information about an edge or the nodes in the edge. We will explain the details after we discuss *qualifiers*.\n", "Let's take a look at the first few lines of the `claims.tsv` file. We see the four required columns and two additional columns that the Wikidata import includes to facilitate processing of the `claims` file using custom scripts. The `rank` column records the Wikidata rank of a statement, and the `node2;wikidatatype` records the Wikidata type of the value in the `node2` column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Claims" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zcat: error writing to output: Broken pipe\n", "id node1 label node2 node2;wikidatatype rank\n", "P10-P1628-32b85d-7927ece6-0 P10 P1628 \"http://www.w3.org/2006/vcard/ns#Video\" url normal\n", "P10-P1628-acf60d-b8950832-0 P10 P1628 \"https://schema.org/video\" url normal\n", "P10-P1629-Q34508-bcc39400-0 P10 P1629 Q34508 wikibase-item normal\n", "P10-P1659-P1651-c4068028-0 P10 P1659 P1651 wikibase-property normal\n", "P10-P1659-P18-5e4b9c4f-0 P10 P1659 P18 wikibase-property normal\n", "P10-P1659-P4238-d21d1ac0-0 P10 P1659 P4238 wikibase-property normal\n", "P10-P1659-P51-86aca4c5-0 P10 P1659 P51 wikibase-property normal\n", "P10-P1855-Q7378-555592a4-0 P10 P1855 Q7378 wikibase-item normal\n", "P10-P2302-Q21502404-d012aef4-0 P10 P2302 Q21502404 wikibase-item normal\n" ] } ], "source": [ "!zcat < \"$CLAIMS\" | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wikidata uses numbers to identify items and properties. We can use the `wd` utility (https://github.com/maxlath/wikibase-cli) to understand the first few lines. The second line states that the `P10` property in Wikidata has an equivalent property in another ontology. Notice that each edge has a distinct id. These ids are unique identifiers for statements (the format of the id can be arbitrary, but we assigned ids so that sorting files by id arranges the information so that all edges about a subject are consecutive." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[90mid\u001b[39m P10\n", "\u001b[42mLabel\u001b[49m video\n", "\u001b[44mDescription\u001b[49m relevant video. For images, use the property P18. For film trailers, qualify with \"object has role\" (P3831)=\"trailer\" (Q622550)\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mWikidata property to link to Commons \u001b[90m(Q18610173)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m P1628\n", "\u001b[42mLabel\u001b[49m equivalent property\n", "\u001b[44mDescription\u001b[49m equivalent property in other ontologies (use in statements on properties, use property URI)\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mWikidata metaproperty for ontology mapping \u001b[90m(Q42842547)\u001b[39m\n", "\n", "\u001b[90mid\u001b[39m P1629\n", "\u001b[42mLabel\u001b[49m subject item of this property\n", "\u001b[44mDescription\u001b[49m relationship represented by the property\n", "\u001b[30m\u001b[47minstance of\u001b[49m\u001b[39m \u001b[90m(P31)\u001b[39m\u001b[90m: \u001b[39mWikidata property for property documentation \u001b[90m(Q19820110)\u001b[39m\n" ] } ], "source": [ "!wd u P10 P1628 P1629" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a more meaningful example. `Q31` (https://www.wikidata.org/wiki/Q31) is the Wikidata item about Belgium. We will use the KGTK query to fetch edges about Belgium. `$kypher` is a shortcut to the `kgtk query` command where in addition we pass in the location of the SQLite database we are using ot store the files. KGTK queries use Cypher syntax (https://neo4j.com/developer/cypher/): the following simple query retrieves 10 edges where `node1` is `Q31`, the q-node for Belgium. The results include an edge with `label` `P1036` (Dewey Decimal Classification) and several edges with label `P1081` (human development index).\n", "\n", " **Note:** We are using the `--as` options in `kgtk query` to set an alias for the `$CLAIMS` file. This alias can be used in the subsequent `kgtk query` commands." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2node2;wikidatatyperank
0Q31-P1036-c4e1ad-df86eeb8-0Q31P1036\"2--493\"external-idnormal
1Q31-P1081-02c2ed-033524b0-0Q31P1081+0.866quantitynormal
2Q31-P1081-02c2ed-7971505b-0Q31P1081+0.866quantitynormal
3Q31-P1081-068470-c1c63b8d-0Q31P1081+0.889quantitynormal
4Q31-P1081-068470-ddac01e0-0Q31P1081+0.889quantitynormal
5Q31-P1081-144738-c1851cdc-0Q31P1081+0.905quantitynormal
6Q31-P1081-175742-c07ac1c8-0Q31P1081+0.888quantitynormal
7Q31-P1081-19636d-c08dd8a8-0Q31P1081+0.896quantitynormal
8Q31-P1081-1efc03-433a7a4d-0Q31P1081+0.913quantitynormal
9Q31-P1081-1f8602-ddac530d-0Q31P1081+0.852quantitynormal
\n", "
" ], "text/plain": [ " id node1 label node2 node2;wikidatatype \\\n", "0 Q31-P1036-c4e1ad-df86eeb8-0 Q31 P1036 \"2--493\" external-id \n", "1 Q31-P1081-02c2ed-033524b0-0 Q31 P1081 +0.866 quantity \n", "2 Q31-P1081-02c2ed-7971505b-0 Q31 P1081 +0.866 quantity \n", "3 Q31-P1081-068470-c1c63b8d-0 Q31 P1081 +0.889 quantity \n", "4 Q31-P1081-068470-ddac01e0-0 Q31 P1081 +0.889 quantity \n", "5 Q31-P1081-144738-c1851cdc-0 Q31 P1081 +0.905 quantity \n", "6 Q31-P1081-175742-c07ac1c8-0 Q31 P1081 +0.888 quantity \n", "7 Q31-P1081-19636d-c08dd8a8-0 Q31 P1081 +0.896 quantity \n", "8 Q31-P1081-1efc03-433a7a4d-0 Q31 P1081 +0.913 quantity \n", "9 Q31-P1081-1f8602-ddac530d-0 Q31 P1081 +0.852 quantity \n", "\n", " rank \n", "0 normal \n", "1 normal \n", "2 normal \n", "3 normal \n", "4 normal \n", "5 normal \n", "6 normal \n", "7 normal \n", "8 normal \n", "9 normal " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i \"$CLAIMS\" --as \"claims\" \\\n", "--match '(:Q31)-[]->()' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output of the command above is hard to read because we are seeing the numeric Wikidata identifiers. To make the output more readable, we need to look up the labels of the Wikidata nodes. This information is in the `labels.en.tsv` file." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id node1 label node2 node2;wikidatatype rank\n", "P10-label-en P10 label 'video'@en\n", "P1000-label-en P1000 label 'record held'@en\n", "P1001-label-en P1001 label 'applies to jurisdiction'@en\n", "P1002-label-en P1002 label 'engine configuration'@en\n", "P1003-label-en P1003 label 'National Library of Romania ID'@en\n", "P1004-label-en P1004 label 'MusicBrainz place ID'@en\n", "P1005-label-en P1005 label 'Portuguese National Library ID'@en\n", "P1006-label-en P1006 label 'Nationale Thesaurus voor Auteurs ID'@en\n", "P1007-label-en P1007 label 'Lattes Platform number'@en\n", "zcat: error writing to output: Broken pipe\n" ] } ], "source": [ "!zcat < \"$LABEL\" | head | column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With KGTK accepts multiple files as input, and can do a join to retrieve the label for each property. When using multiple files, it is necessary to tag each clause with the file that provides the data for the clause. For example, the first clause is tagged with `claim` as the word `claim` is part of the file name. The variable property is used to connect the two clauses.\n", "\n", "**Note:** We user the alias `claims` defined in a previous cell and introduced a new alias for the `$LABEL` file" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2label;label
0Q31-P1036-c4e1ad-df86eeb8-0Q31P1036\"2--493\"'Dewey Decimal Classification'@en
1Q31-P1081-02c2ed-033524b0-0Q31P1081+0.866'Human Development Index'@en
2Q31-P1081-02c2ed-7971505b-0Q31P1081+0.866'Human Development Index'@en
3Q31-P1081-068470-c1c63b8d-0Q31P1081+0.889'Human Development Index'@en
4Q31-P1081-068470-ddac01e0-0Q31P1081+0.889'Human Development Index'@en
5Q31-P1081-144738-c1851cdc-0Q31P1081+0.905'Human Development Index'@en
6Q31-P1081-175742-c07ac1c8-0Q31P1081+0.888'Human Development Index'@en
7Q31-P1081-19636d-c08dd8a8-0Q31P1081+0.896'Human Development Index'@en
8Q31-P1081-1efc03-433a7a4d-0Q31P1081+0.913'Human Development Index'@en
9Q31-P1081-1f8602-ddac530d-0Q31P1081+0.852'Human Development Index'@en
\n", "
" ], "text/plain": [ " id node1 label node2 \\\n", "0 Q31-P1036-c4e1ad-df86eeb8-0 Q31 P1036 \"2--493\" \n", "1 Q31-P1081-02c2ed-033524b0-0 Q31 P1081 +0.866 \n", "2 Q31-P1081-02c2ed-7971505b-0 Q31 P1081 +0.866 \n", "3 Q31-P1081-068470-c1c63b8d-0 Q31 P1081 +0.889 \n", "4 Q31-P1081-068470-ddac01e0-0 Q31 P1081 +0.889 \n", "5 Q31-P1081-144738-c1851cdc-0 Q31 P1081 +0.905 \n", "6 Q31-P1081-175742-c07ac1c8-0 Q31 P1081 +0.888 \n", "7 Q31-P1081-19636d-c08dd8a8-0 Q31 P1081 +0.896 \n", "8 Q31-P1081-1efc03-433a7a4d-0 Q31 P1081 +0.913 \n", "9 Q31-P1081-1f8602-ddac530d-0 Q31 P1081 +0.852 \n", "\n", " label;label \n", "0 'Dewey Decimal Classification'@en \n", "1 'Human Development Index'@en \n", "2 'Human Development Index'@en \n", "3 'Human Development Index'@en \n", "4 'Human Development Index'@en \n", "5 'Human Development Index'@en \n", "6 'Human Development Index'@en \n", "7 'Human Development Index'@en \n", "8 'Human Development Index'@en \n", "9 'Human Development Index'@en " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i claims -i \"$LABEL\" --as \"labels\" \\\n", "--match 'claim: (n1:Q31)-[l {label: property}]->(n2), label: (property)-[:label]->(property_label)' \\\n", "--return 'l as id, n1 as node1, property as label, n2 as node2, property_label as `label;label`' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get all the distinct properties defined for Belgium" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labellabel;label
0P1036'Dewey Decimal Classification'@en
1P1081'Human Development Index'@en
2P1082'population'@en
3P1151'topic\\\\'s main Wikimedia portal'@en
4P1198'unemployment rate'@en
.........
205P949'National Library of Israel ID'@en
206P982'MusicBrainz area ID'@en
207P984'IOC country code'@en
208P989'spoken text audio'@en
209P998'DMOZ ID'@en
\n", "

210 rows × 2 columns

\n", "
" ], "text/plain": [ " label label;label\n", "0 P1036 'Dewey Decimal Classification'@en\n", "1 P1081 'Human Development Index'@en\n", "2 P1082 'population'@en\n", "3 P1151 'topic\\\\'s main Wikimedia portal'@en\n", "4 P1198 'unemployment rate'@en\n", ".. ... ...\n", "205 P949 'National Library of Israel ID'@en\n", "206 P982 'MusicBrainz area ID'@en\n", "207 P984 'IOC country code'@en\n", "208 P989 'spoken text audio'@en\n", "209 P998 'DMOZ ID'@en\n", "\n", "[210 rows x 2 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i claims -i \"$LABEL\" --as \"labels\" \\\n", "--match 'claim: (n1:Q31)-[l {label: property}]->(n2), label: (property)-[:label]->(property_label)' \\\n", "--return 'distinct property as label, property_label as `label;label`' \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at a the classes that Belgium is an instance of, recorded in property `P31`" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2node2;label
0Q31-P31-Q1250464-7c4e239d-0Q31P31Q1250464'realm'@en
1Q31-P31-Q185441-58d7de2e-0Q31P31Q185441'member state of the European Union'@en
2Q31-P31-Q20181813-8e41ab67-0Q31P31Q20181813'colonial power'@en
3Q31-P31-Q3624078-a1d9d1a3-0Q31P31Q3624078'sovereign state'@en
4Q31-P31-Q43702-0dce2031-0Q31P31Q43702'federation'@en
5Q31-P31-Q6256-3422ad69-0Q31P31Q6256'country'@en
\n", "
" ], "text/plain": [ " id node1 label node2 \\\n", "0 Q31-P31-Q1250464-7c4e239d-0 Q31 P31 Q1250464 \n", "1 Q31-P31-Q185441-58d7de2e-0 Q31 P31 Q185441 \n", "2 Q31-P31-Q20181813-8e41ab67-0 Q31 P31 Q20181813 \n", "3 Q31-P31-Q3624078-a1d9d1a3-0 Q31 P31 Q3624078 \n", "4 Q31-P31-Q43702-0dce2031-0 Q31 P31 Q43702 \n", "5 Q31-P31-Q6256-3422ad69-0 Q31 P31 Q6256 \n", "\n", " node2;label \n", "0 'realm'@en \n", "1 'member state of the European Union'@en \n", "2 'colonial power'@en \n", "3 'sovereign state'@en \n", "4 'federation'@en \n", "5 'country'@en " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i claims -i labels \\\n", "--match 'claims: (n1:Q31)-[l:P31]->(n2), labels: (n2)-[:label]->(n2_label)' \\\n", "--return 'l as id, n1 as node1, l.label as label, n2 as node2, n2_label as `node2;label`' \\\n", "--limit 10 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get all the values for population" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0Q31-P1082-03700d-e9540ac9-0Q31P1082+10136811
1Q31-P1082-04bed1-dfb79a97-0Q31P1082+9772419
2Q31-P1082-09cf36-da068a8a-0Q31P1082+9153489
3Q31-P1082-0d8ab5-e1fa3416-0Q31P1082+9858308
4Q31-P1082-10985f-021cd5f9-0Q31P1082+9618756
...............
65Q31-P1082-ee304f-78930d38-0Q31P1082+9830358
66Q31-P1082-f304d4-5b5295bb-0Q31P1082+9859242
67Q31-P1082-f90107-aedcfbe5-0Q31P1082+10445852
68Q31-P1082-fa9783-4e530113-0Q31P1082+10203008
69Q31-P1082-fb1f82-f3860fe1-0Q31P1082+9646032
\n", "

70 rows × 4 columns

\n", "
" ], "text/plain": [ " id node1 label node2\n", "0 Q31-P1082-03700d-e9540ac9-0 Q31 P1082 +10136811\n", "1 Q31-P1082-04bed1-dfb79a97-0 Q31 P1082 +9772419\n", "2 Q31-P1082-09cf36-da068a8a-0 Q31 P1082 +9153489\n", "3 Q31-P1082-0d8ab5-e1fa3416-0 Q31 P1082 +9858308\n", "4 Q31-P1082-10985f-021cd5f9-0 Q31 P1082 +9618756\n", ".. ... ... ... ...\n", "65 Q31-P1082-ee304f-78930d38-0 Q31 P1082 +9830358\n", "66 Q31-P1082-f304d4-5b5295bb-0 Q31 P1082 +9859242\n", "67 Q31-P1082-f90107-aedcfbe5-0 Q31 P1082 +10445852\n", "68 Q31-P1082-fa9783-4e530113-0 Q31 P1082 +10203008\n", "69 Q31-P1082-fb1f82-f3860fe1-0 Q31 P1082 +9646032\n", "\n", "[70 rows x 4 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i claims -i labels \\\n", "--match 'claims: (n1:Q31)-[l:P1082]->(n2)' \\\n", "--return 'l as id, n1 as node1, l.label as label, n2 as node2' \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Qualifiers\n", "Qualifiers provide additional information about the claims stated in the edges. For `P1082` the qualifiers tell use the year when the population was measured. The qualifiers can be retrieved using the identifiers of the edges. Let's retrieve the qualifiers associated with the edge for the first population value. To do so, we use the identifier of the edge (`Q31-P1082-03700d-e9540ac9-0`) as `node1` in the `qualifiers.tsv` file. We get one edge, so we know that the population in `1995` was `10136811`. Note that the qualifier edges are the same as any other edge in KGTK, having `id`, `node1`, `label` and `node2` columns:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2node2;wikidatatyperank
0Q31-P1082-03700d-e9540ac9-0-P585-2a74fa-0Q31-P1082-03700d-e9540ac9-0P585^1995-00-00T00:00:00Z/9time
\n", "
" ], "text/plain": [ " id node1 \\\n", "0 Q31-P1082-03700d-e9540ac9-0-P585-2a74fa-0 Q31-P1082-03700d-e9540ac9-0 \n", "\n", " label node2 node2;wikidatatype rank \n", "0 P585 ^1995-00-00T00:00:00Z/9 time " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i \"$QUALIFIERS\" --as \"qualifiers\" \\\n", "--match '(n1:`Q31-P1082-03700d-e9540ac9-0`)-[l]->(n2)' \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make the qualifier edge more readable by retrieving the label of the property: the following query combines the patterns of the previous two queries to retrieve the labels of the property and node2. The query omits the identifier of the qualifier edges to save space. Also, the headers of the two additional columns can be arbitrary, i.e., you can name them whatever you want; the names used follow a KGTK convention that enabled KGTK to automatically parse the output, which is useful if we want to use the output as an input to another KGTK command. The word before the `;` refers to one of the standard columns, and the name after the `;` refers to a property of that element. In this example, we used `label` as the column contains the label of the entity." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1 label node2 label;label\n", "Q31-P1082-03700d-e9540ac9-0 P585 ^1995-00-00T00:00:00Z/9 'point in time'@en\n" ] } ], "source": [ "!$kypher -i qualifiers -i labels \\\n", "--match 'qual: (n1:`Q31-P1082-03700d-e9540ac9-0`)-[l {label: property}]->(n2), labels: (property)-[:label]->(property_label)' \\\n", "--return 'n1 as node1, property as label, n2 as node2, property_label as `label;label`' \\\n", "--limit 10 \\\n", "| column -t -s $'\\t'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put all the values of `P1082` in a file, which we will conveniently name `Q31.P1082.tsv`" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "!$kypher -i claims \\\n", "--match '(n1:Q31)-[l:P1082]->(n2)' \\\n", "--return 'l as id, n1 as node1, l.label as label, n2 as node2' \\\n", "-o \"$TEMP\"/Q31.P1082.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to combine the `P1082` edges of Belgium with the qualifiers. To do this we will run a query that uses the edges that we stored in `Q31.P1082.tsv`, and retrieve the qualifiers for each of those edges; the result of our query will be the qualifier edges of the head of state edges. To union the qualifier edges with the claim edges, we feed the output of the query to the `cat` command (concatenate), and then feed the output to the `sort2` command to sort the edges. The first 12 edges are shown below. We see a claim edge followed by the qualifiers defined for it.\n", "\n", "This snippet illustrates that KGTK commands can be chained using the `/` chain operator to compose more complex workflows." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnode1labelnode2
0Q31-P1082-03700d-e9540ac9-0Q31P1082+10136811
1Q31-P1082-03700d-e9540ac9-0-P585-2a74fa-0Q31-P1082-03700d-e9540ac9-0P585^1995-00-00T00:00:00Z/9
2Q31-P1082-04bed1-dfb79a97-0Q31P1082+9772419
3Q31-P1082-04bed1-dfb79a97-0-P585-271261-0Q31-P1082-04bed1-dfb79a97-0P585^1974-00-00T00:00:00Z/9
4Q31-P1082-09cf36-da068a8a-0Q31P1082+9153489
...............
135Q31-P1082-f90107-aedcfbe5-0-P585-cab8cf-0Q31-P1082-f90107-aedcfbe5-0P585^2005-01-01T00:00:00Z/11
136Q31-P1082-fa9783-4e530113-0Q31P1082+10203008
137Q31-P1082-fa9783-4e530113-0-P585-12d4de-0Q31-P1082-fa9783-4e530113-0P585^1998-00-00T00:00:00Z/9
138Q31-P1082-fb1f82-f3860fe1-0Q31P1082+9646032
139Q31-P1082-fb1f82-f3860fe1-0-P585-87910b-0Q31-P1082-fb1f82-f3860fe1-0P585^1969-00-00T00:00:00Z/9
\n", "

140 rows × 4 columns

\n", "
" ], "text/plain": [ " id node1 \\\n", "0 Q31-P1082-03700d-e9540ac9-0 Q31 \n", "1 Q31-P1082-03700d-e9540ac9-0-P585-2a74fa-0 Q31-P1082-03700d-e9540ac9-0 \n", "2 Q31-P1082-04bed1-dfb79a97-0 Q31 \n", "3 Q31-P1082-04bed1-dfb79a97-0-P585-271261-0 Q31-P1082-04bed1-dfb79a97-0 \n", "4 Q31-P1082-09cf36-da068a8a-0 Q31 \n", ".. ... ... \n", "135 Q31-P1082-f90107-aedcfbe5-0-P585-cab8cf-0 Q31-P1082-f90107-aedcfbe5-0 \n", "136 Q31-P1082-fa9783-4e530113-0 Q31 \n", "137 Q31-P1082-fa9783-4e530113-0-P585-12d4de-0 Q31-P1082-fa9783-4e530113-0 \n", "138 Q31-P1082-fb1f82-f3860fe1-0 Q31 \n", "139 Q31-P1082-fb1f82-f3860fe1-0-P585-87910b-0 Q31-P1082-fb1f82-f3860fe1-0 \n", "\n", " label node2 \n", "0 P1082 +10136811 \n", "1 P585 ^1995-00-00T00:00:00Z/9 \n", "2 P1082 +9772419 \n", "3 P585 ^1974-00-00T00:00:00Z/9 \n", "4 P1082 +9153489 \n", ".. ... ... \n", "135 P585 ^2005-01-01T00:00:00Z/11 \n", "136 P1082 +10203008 \n", "137 P585 ^1998-00-00T00:00:00Z/9 \n", "138 P1082 +9646032 \n", "139 P585 ^1969-00-00T00:00:00Z/9 \n", "\n", "[140 rows x 4 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = !$kypher -i qualifiers -i \"$TEMP\"/Q31.P1082.tsv \\\n", "--match 'P1082: ()-[l]->(), qual: (l)-[lq]->(n2)' \\\n", "--return 'lq as id, l as node1, lq.label as label, n2 as node2' \\\n", "/ cat -i - -i \"$TEMP\"/Q31.P1082.tsv \\\n", "/ sort2 \n", "\n", "kgtk_to_dataframe(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "- KGTK represents graphs in TSV files with standard columns `id`, `node1`, `label` and `node2`\n", "- It is possible to include arbitrary additional columns in KGTK files\n", "- The identifier of an edge can be used as a node in another edge enabling the representation of edges about edges\n", "- KGTK provides a powerful query command based on Cypher as well as a host of other commands, type `kgtk --help` to see the list of commands." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kgtk-env", "language": "python", "name": "kgtk-env" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }