# KGTK Tutorial: Introduction

We begin the tutorial with a quick overview of some of the commands in KGTK. Then we turn our attention to working with Wikidata.

## Tutorial Setup

Import utility functions and define environment variables for the folders and files that we will use

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

ALIAS: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/aliases.en.tsv.gz"
ALL: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/all.tsv.gz"
CLAIMS: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.tsv.gz"
DESCRIPTION: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/pedroszekely/Documents/GitHub/kgtk/examples"
GE: "/Users/pedroszekely/Downloads/kgtk-tutorial/temp/graph-embedding"
ISA: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.isa.tsv.gz"
ITEM: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.wikibase-item.tsv.gz"
KGTK_PATH: "/Users/pedroszekely/Documents/GitHub/kgtk"
LABEL: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/labels.en.tsv.gz"
OUT: "/Users/pedroszekely/Downloads/kgtk-tutorial/output"
P279: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.P279.tsv.gz"
P279STAR: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/de

In [2]:
!mkdir -p {output_path}
%cd {output_path}

/Users/pedroszekely/Downloads/kgtk-tutorial


In [3]:
!mkdir -p {output_folder}
!mkdir -p {temp_folder}

In [4]:
!mkdir -p "$GE"
!mkdir -p "$TE"

## Quick Tour Of KGTK Commands

Our sample input file:

In [5]:
lines = !$kgtk cat -i "$KGTK_PATH"/tutorial/datasets/movies.tsv
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2
0,,terminator2_jd,label,"""Terminator 2""@en"
1,,terminator2_jd,instance_of,film
2,,terminator2_jd,genre,science_fiction
3,,terminator2_jd,genre,action
4,t4,terminator2_jd,cast,a_schwarzenegger
5,,t4,role,terminator
6,t6,terminator2_jd,cast,l_hamilton
7,,t6,role,s_connor
8,t8,terminator2_jd,award,academy-best-sound-editing
9,,t8,point_in_time,^1992-03-30T00:00:00Z/11


Many edges are missing ids, let's add ids for them. We are adding wikidata-style ids, but there are many other styles:

In [6]:
lines = !$kgtk add-id --id-style wikidata -i "$KGTK_PATH"/tutorial/datasets/movies.tsv
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2
0,terminator2_jd-label-01de63,terminator2_jd,label,"""Terminator 2""@en"
1,terminator2_jd-instance_of-d0607f,terminator2_jd,instance_of,film
2,terminator2_jd-genre-2e6128,terminator2_jd,genre,science_fiction
3,terminator2_jd-genre-bd938c,terminator2_jd,genre,action
4,t4,terminator2_jd,cast,a_schwarzenegger
5,t4-role-aa802f,t4,role,terminator
6,t6,terminator2_jd,cast,l_hamilton
7,t6-role-a29a51,t6,role,s_connor
8,t8,terminator2_jd,award,academy-best-sound-editing
9,t8-point_in_time-370fac,t8,point_in_time,^1992-03-30T00:00:00Z/11


Put the new version of the movies with ids in a file:

In [7]:
!$kgtk add-id --id-style wikidata -i "$KGTK_PATH"/tutorial/datasets/movies.tsv \
-o "$TEMP"/movies.ids.tsv

Sort the file by id (there are many other ways to sort):

In [8]:
lines = !$kgtk sort -i "$TEMP"/movies.ids.tsv
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2
0,a_schwarzenegger-label-2a4c28,a_schwarzenegger,label,"""Arnold Schwarzenegger""@en"
1,film-subclass_of-f126ab,film,subclass_of,visual_artwork
2,instance_of-label-0e46af,instance_of,label,"""instance of""@en"
3,l_hamilton-label-2b3667,l_hamilton,label,"""Linda Hamilton""@en"
4,t15-location-303f2a,t15,location,united_states
5,t17-location-295099,t17,location,sweden
6,t4,terminator2_jd,cast,a_schwarzenegger
7,t4-role-aa802f,t4,role,terminator
8,t6,terminator2_jd,cast,l_hamilton
9,t6-role-a29a51,t6,role,s_connor


It is nice to be able to see the labels of the nodes. We can use the lift command to lift the lables from rows to columns (It is possible to lift other relations too):

In [9]:
lines = !$kgtk lift -i "$TEMP"/movies.ids.tsv 
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2,node1;label,label;label,node2;label
0,terminator2_jd-instance_of-d0607f,terminator2_jd,instance_of,film,"""Terminator 2""@en","""instance of""@en",
1,terminator2_jd-genre-2e6128,terminator2_jd,genre,science_fiction,"""Terminator 2""@en",,
2,terminator2_jd-genre-bd938c,terminator2_jd,genre,action,"""Terminator 2""@en",,
3,t4,terminator2_jd,cast,a_schwarzenegger,"""Terminator 2""@en",,"""Arnold Schwarzenegger""@en"
4,t4-role-aa802f,t4,role,terminator,,,
5,t6,terminator2_jd,cast,l_hamilton,"""Terminator 2""@en",,"""Linda Hamilton""@en"
6,t6-role-a29a51,t6,role,s_connor,,,
7,t8,terminator2_jd,award,academy-best-sound-editing,"""Terminator 2""@en",,
8,t8-point_in_time-370fac,t8,point_in_time,^1992-03-30T00:00:00Z/11,,,
9,t8-winner-dc3cda,t8,winner,g_rydstrom,,,


The KGTK equivalent of grep:

In [10]:
lines = !$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";cast,genre;"
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2
0,terminator2_jd-genre-2e6128,terminator2_jd,genre,science_fiction
1,terminator2_jd-genre-bd938c,terminator2_jd,genre,action
2,t4,terminator2_jd,cast,a_schwarzenegger
3,t6,terminator2_jd,cast,l_hamilton


Filter also supports regular expressioins. Here are the edges that have `mi` somewhere and end with `@en`:

In [11]:
lines = !$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";;mi.*@en" --regex --match-type search
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2
0,terminator2_jd-label-01de63,terminator2_jd,label,"""Terminator 2""@en"
1,l_hamilton-label-2b3667,l_hamilton,label,"""Linda Hamilton""@en"


The `md` command makes it easy to convert the output to markdown:

In [12]:
!$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";cast,genre;" / md 

| id | node1 | label | node2 |
| -- | -- | -- | -- |
| terminator2_jd-genre-2e6128 | terminator2_jd | genre | science_fiction |
| terminator2_jd-genre-bd938c | terminator2_jd | genre | action |
| t4 | terminator2_jd | cast | a_schwarzenegger |
| t6 | terminator2_jd | cast | l_hamilton |


The `cat` command has many output formats, so we can output CSV:

In [13]:
!$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";cast,genre;" / cat --output-format csv 

id,node1,label,node2
terminator2_jd-genre-2e6128,terminator2_jd,genre,science_fiction
terminator2_jd-genre-bd938c,terminator2_jd,genre,action
t4,terminator2_jd,cast,a_schwarzenegger
t6,terminator2_jd,cast,l_hamilton


Can also output JSON (and several other formats):

In [14]:
!$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";cast,genre;" / cat --output-format json-map 

[
{"id":"terminator2_jd-genre-2e6128","node1":"terminator2_jd","label":"genre","node2":"science_fiction"},
{"id":"terminator2_jd-genre-bd938c","node1":"terminator2_jd","label":"genre","node2":"action"},
{"id":"t4","node1":"terminator2_jd","label":"cast","node2":"a_schwarzenegger"},
{"id":"t6","node1":"terminator2_jd","label":"cast","node2":"l_hamilton"}
]


Remove the `id` and `label` columns

In [15]:
lines = !$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";cast,genre;"  \
/ remove-columns -c id label
kgtk_to_dataframe(lines)

Unnamed: 0,node1,node2
0,terminator2_jd,science_fiction
1,terminator2_jd,action
2,terminator2_jd,a_schwarzenegger
3,terminator2_jd,l_hamilton


In one go remove the columns we don't want and then rename them to good names:

In [16]:
lines = !$kgtk filter -i "$TEMP"/movies.ids.tsv -p ";cast,genre;"  \
/ remove-columns -c id label \
/ rename-columns --mode NONE --output-columns movie_id title 

kgtk_to_dataframe(lines)

Unnamed: 0,movie_id,title
0,terminator2_jd,science_fiction
1,terminator2_jd,action
2,terminator2_jd,a_schwarzenegger
3,terminator2_jd,l_hamilton


Count the number of distinct values in column `label`:

In [17]:
lines = !$kgtk unique -i "$TEMP"/movies.ids.tsv  --column label / sort -c node2

kgtk_to_dataframe(lines)

Unnamed: 0,node1,label,node2
0,award,count,1
1,duration,count,1
2,instance_of,count,1
3,point_in_time,count,1
4,subclass_of,count,1
5,cast,count,2
6,genre,count,2
7,location,count,2
8,publication_date,count,2
9,role,count,2


Expand the structured literals into columns with the consittuents to make it easy for developers to parse the structured literals:

In [18]:
lines = !$kgtk explode -i "$TEMP"/movies.ids.tsv
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2,node2;kgtk:data_type,node2;kgtk:valid,node2;kgtk:list_len,node2;kgtk:number,node2;kgtk:low_tolerance,node2;kgtk:high_tolerance,...,node2;kgtk:units_node,node2;kgtk:text,node2;kgtk:language,node2;kgtk:language_suffix,node2;kgtk:latitude,node2;kgtk:longitude,node2;kgtk:date_and_time,node2;kgtk:precision,node2;kgtk:truth,node2;kgtk:symbol
0,terminator2_jd-label-01de63,terminator2_jd,label,"""Terminator 2""@en",,,,,,,...,,,,,,,,,,
1,terminator2_jd-instance_of-d0607f,terminator2_jd,instance_of,film,symbol,True,0.0,,,,...,,,,,,,,,,film
2,terminator2_jd-genre-2e6128,terminator2_jd,genre,science_fiction,symbol,True,0.0,,,,...,,,,,,,,,,science_fiction
3,terminator2_jd-genre-bd938c,terminator2_jd,genre,action,symbol,True,0.0,,,,...,,,,,,,,,,action
4,t4,terminator2_jd,cast,a_schwarzenegger,symbol,True,0.0,,,,...,,,,,,,,,,a_schwarzenegger
5,t4-role-aa802f,t4,role,terminator,symbol,True,0.0,,,,...,,,,,,,,,,terminator
6,t6,terminator2_jd,cast,l_hamilton,symbol,True,0.0,,,,...,,,,,,,,,,l_hamilton
7,t6-role-a29a51,t6,role,s_connor,symbol,True,0.0,,,,...,,,,,,,,,,s_connor
8,t8,terminator2_jd,award,academy-best-sound-editing,symbol,True,0.0,,,,...,,,,,,,,,,academy-best-sound-editing
9,t8-point_in_time-370fac,t8,point_in_time,^1992-03-30T00:00:00Z/11,date_and_times,True,0.0,,,,...,,,,,,,"""1992-03-30T00:00:00Z""",11.0,,


# Wikidata in KGTK
KGTK has the ability to import a Wikidata JSON dump and covert it to the KGTK representation to make it easy to process the full Wikidata KG in a laptop. There are 86 files which include all the information available in the Wikidata dump and files containing commonly used information derived from the dump. We partitioned the files because in most use cases you only need to use a subset of the files.

The files are very large. `claims.tsv` (23GB compressed) contains all the statements in the Wikidata dump, `qualifiers.tsv` contains the qualifiers of those edges, and `labels.en.tsv`, `aliases.en.tsv` and `descriptions.en.tsv` contain the English labels, aliases and descriptions.

In [19]:
!ls -lh "$CLAIMS" "$QUALIFIERS" "$LABEL" "$ALIAS" "$DESCRIPTION"

-rw-r--r--  1 pedroszekely  staff    32M Jan 24 00:32 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/aliases.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   1.7G Jan 24 00:30 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.tsv.gz
-rw-r--r--  1 pedroszekely  staff   122M Jan 24 00:33 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/descriptions.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   167M Jan 24 00:35 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/labels.en.tsv.gz
-rw-r--r--  1 pedroszekely  staff   264M Jan 24 00:32 /Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/qualifiers.tsv.gz


`claims.tsv` contains many edges:

In [20]:
!zcat < "$CLAIMS" | wc

 94123796 587802328 7562639743


# KGTK Data Model
The KGTK data model is a generalization of RDF and property graphs, inspired by the Wikidata data model. In KGTK, a KG is represented using TSV files with four columns: three columns to store the subject, predicate and object of a triple, and a fourth column to store an identifier for the triple. By convention, we use the heading `id` for the identifier, `node1` for the subject, `node2` for the object and `label` for the predicate, as it labels the edge between `node1` and `node2`. The order of the columns is arbitrary.

All KGTK files must include the required `id`, `node1`, `label` and `node2` columns, and can contain additional columns to store addtional information about an edge or the nodes in the edge. We will explain the details after we discuss *qualifiers*.
Let's take a look at the first few lines of the `claims.tsv` file. We see the four required columns and two additional columns that the Wikidata import includes to facilitate processing of the `claims` file using custom scripts. The `rank` column records the Wikidata rank of a statement, and the `node2;wikidatatype` records the Wikidata type of the value in the `node2` column.

## Claims

In [21]:
!zcat < "$CLAIMS" | head | column -t -s $'\t'

zcat: error writing to output: Broken pipe
id                              node1  label  node2                                    node2;wikidatatype  rank
P10-P1628-32b85d-7927ece6-0     P10    P1628  "http://www.w3.org/2006/vcard/ns#Video"  url                 normal
P10-P1628-acf60d-b8950832-0     P10    P1628  "https://schema.org/video"               url                 normal
P10-P1629-Q34508-bcc39400-0     P10    P1629  Q34508                                   wikibase-item       normal
P10-P1659-P1651-c4068028-0      P10    P1659  P1651                                    wikibase-property   normal
P10-P1659-P18-5e4b9c4f-0        P10    P1659  P18                                      wikibase-property   normal
P10-P1659-P4238-d21d1ac0-0      P10    P1659  P4238                                    wikibase-property   normal
P10-P1659-P51-86aca4c5-0        P10    P1659  P51                                      wikibase-property   normal
P10-P1855-Q7378-555592a4-0      P10    P1855  Q

Wikidata uses numbers to identify items and properties. We can use the `wd` utility (https://github.com/maxlath/wikibase-cli) to understand the first few lines. The second line states that the `P10` property in Wikidata has an equivalent property in another ontology. Notice that each edge has a distinct id. These ids are unique identifiers for statements (the format of the id can be arbitrary, but we assigned ids so that sorting files by id arranges the information so that all edges about a subject are consecutive.

In [22]:
!wd u P10 P1628 P1629

[90mid[39m P10
[42mLabel[49m video
[44mDescription[49m relevant video. For images, use the property P18. For film trailers, qualify with "object has role" (P3831)="trailer" (Q622550)
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property to link to Commons [90m(Q18610173)[39m

[90mid[39m P1628
[42mLabel[49m equivalent property
[44mDescription[49m equivalent property in other ontologies (use in statements on properties, use property URI)
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata metaproperty for ontology mapping [90m(Q42842547)[39m

[90mid[39m P1629
[42mLabel[49m subject item of this property
[44mDescription[49m relationship represented by the property
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for property documentation [90m(Q19820110)[39m


Let's look at a more meaningful example. `Q31` (https://www.wikidata.org/wiki/Q31) is the Wikidata item about Belgium. We will use the KGTK query to fetch edges about Belgium. `$kypher` is a shortcut to the `kgtk query` command where in addition we pass in the location of the SQLite database we are using ot store the files. KGTK queries use Cypher syntax (https://neo4j.com/developer/cypher/): the following simple query retrieves 10 edges where `node1` is `Q31`, the q-node for Belgium. The results include an edge with `label` `P1036` (Dewey Decimal Classification) and several edges with label `P1081` (human development index).

 **Note:** We are using the `--as` options in `kgtk query` to set an alias for the `$CLAIMS` file. This alias can be used in the subsequent `kgtk query` commands.

In [33]:
result = !$kypher -i "$CLAIMS" --as "claims" \
--match '(:Q31)-[]->()' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,node2;wikidatatype,rank
0,Q31-P1036-c4e1ad-df86eeb8-0,Q31,P1036,"""2--493""",external-id,normal
1,Q31-P1081-02c2ed-033524b0-0,Q31,P1081,+0.866,quantity,normal
2,Q31-P1081-02c2ed-7971505b-0,Q31,P1081,+0.866,quantity,normal
3,Q31-P1081-068470-c1c63b8d-0,Q31,P1081,+0.889,quantity,normal
4,Q31-P1081-068470-ddac01e0-0,Q31,P1081,+0.889,quantity,normal
5,Q31-P1081-144738-c1851cdc-0,Q31,P1081,+0.905,quantity,normal
6,Q31-P1081-175742-c07ac1c8-0,Q31,P1081,+0.888,quantity,normal
7,Q31-P1081-19636d-c08dd8a8-0,Q31,P1081,+0.896,quantity,normal
8,Q31-P1081-1efc03-433a7a4d-0,Q31,P1081,+0.913,quantity,normal
9,Q31-P1081-1f8602-ddac530d-0,Q31,P1081,+0.852,quantity,normal


The output of the command above is hard to read because we are seeing the numeric Wikidata identifiers. To make the output more readable, we need to look up the labels of the Wikidata nodes. This information is in the `labels.en.tsv` file.

In [34]:
!zcat < "$LABEL" | head | column -t -s $'\t'

id              node1  label  node2                                     node2;wikidatatype  rank
P10-label-en    P10    label  'video'@en
P1000-label-en  P1000  label  'record held'@en
P1001-label-en  P1001  label  'applies to jurisdiction'@en
P1002-label-en  P1002  label  'engine configuration'@en
P1003-label-en  P1003  label  'National Library of Romania ID'@en
P1004-label-en  P1004  label  'MusicBrainz place ID'@en
P1005-label-en  P1005  label  'Portuguese National Library ID'@en
P1006-label-en  P1006  label  'Nationale Thesaurus voor Auteurs ID'@en
P1007-label-en  P1007  label  'Lattes Platform number'@en
zcat: error writing to output: Broken pipe


With KGTK accepts multiple files as input, and can do a join to retrieve the label for each property. When using multiple files, it is necessary to tag each clause with the file that provides the data for the clause. For example, the first clause is tagged with `claim` as the word `claim` is part of the file name. The variable property is used to connect the two clauses.

**Note:** We user the alias `claims` defined in a previous cell and introduced a new alias for the `$LABEL` file

In [35]:
result = !$kypher -i claims -i "$LABEL" --as "labels" \
--match 'claim: (n1:Q31)-[l {label: property}]->(n2), label: (property)-[:label]->(property_label)' \
--return 'l as id, n1 as node1, property as label, n2 as node2, property_label as `label;label`' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,label;label
0,Q31-P1036-c4e1ad-df86eeb8-0,Q31,P1036,"""2--493""",'Dewey Decimal Classification'@en
1,Q31-P1081-02c2ed-033524b0-0,Q31,P1081,+0.866,'Human Development Index'@en
2,Q31-P1081-02c2ed-7971505b-0,Q31,P1081,+0.866,'Human Development Index'@en
3,Q31-P1081-068470-c1c63b8d-0,Q31,P1081,+0.889,'Human Development Index'@en
4,Q31-P1081-068470-ddac01e0-0,Q31,P1081,+0.889,'Human Development Index'@en
5,Q31-P1081-144738-c1851cdc-0,Q31,P1081,+0.905,'Human Development Index'@en
6,Q31-P1081-175742-c07ac1c8-0,Q31,P1081,+0.888,'Human Development Index'@en
7,Q31-P1081-19636d-c08dd8a8-0,Q31,P1081,+0.896,'Human Development Index'@en
8,Q31-P1081-1efc03-433a7a4d-0,Q31,P1081,+0.913,'Human Development Index'@en
9,Q31-P1081-1f8602-ddac530d-0,Q31,P1081,+0.852,'Human Development Index'@en


Get all the distinct properties defined for Belgium

In [26]:
result = !$kypher -i claims -i "$LABEL" --as "labels" \
--match 'claim: (n1:Q31)-[l {label: property}]->(n2), label: (property)-[:label]->(property_label)' \
--return 'distinct property as label, property_label as `label;label`' 

kgtk_to_dataframe(result)

Unnamed: 0,label,label;label
0,P1036,'Dewey Decimal Classification'@en
1,P1081,'Human Development Index'@en
2,P1082,'population'@en
3,P1151,'topic\\'s main Wikimedia portal'@en
4,P1198,'unemployment rate'@en
...,...,...
205,P949,'National Library of Israel ID'@en
206,P982,'MusicBrainz area ID'@en
207,P984,'IOC country code'@en
208,P989,'spoken text audio'@en


Let's look at a the classes that Belgium is an instance of, recorded in property `P31`

In [27]:
result = !$kypher -i claims -i labels \
--match 'claims: (n1:Q31)-[l:P31]->(n2), labels: (n2)-[:label]->(n2_label)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2, n2_label as `node2;label`' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,node2;label
0,Q31-P31-Q1250464-7c4e239d-0,Q31,P31,Q1250464,'realm'@en
1,Q31-P31-Q185441-58d7de2e-0,Q31,P31,Q185441,'member state of the European Union'@en
2,Q31-P31-Q20181813-8e41ab67-0,Q31,P31,Q20181813,'colonial power'@en
3,Q31-P31-Q3624078-a1d9d1a3-0,Q31,P31,Q3624078,'sovereign state'@en
4,Q31-P31-Q43702-0dce2031-0,Q31,P31,Q43702,'federation'@en
5,Q31-P31-Q6256-3422ad69-0,Q31,P31,Q6256,'country'@en


Get all the values for population

In [28]:
result = !$kypher -i claims -i labels \
--match 'claims: (n1:Q31)-[l:P1082]->(n2)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2' 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2
0,Q31-P1082-03700d-e9540ac9-0,Q31,P1082,+10136811
1,Q31-P1082-04bed1-dfb79a97-0,Q31,P1082,+9772419
2,Q31-P1082-09cf36-da068a8a-0,Q31,P1082,+9153489
3,Q31-P1082-0d8ab5-e1fa3416-0,Q31,P1082,+9858308
4,Q31-P1082-10985f-021cd5f9-0,Q31,P1082,+9618756
...,...,...,...,...
65,Q31-P1082-ee304f-78930d38-0,Q31,P1082,+9830358
66,Q31-P1082-f304d4-5b5295bb-0,Q31,P1082,+9859242
67,Q31-P1082-f90107-aedcfbe5-0,Q31,P1082,+10445852
68,Q31-P1082-fa9783-4e530113-0,Q31,P1082,+10203008


## Qualifiers
Qualifiers provide additional information about the claims stated in the edges. For `P1082` the qualifiers tell use the year when the population was measured. The qualifiers can be retrieved using the identifiers of the edges. Let's retrieve the qualifiers associated with the edge for the first population value. To do so, we use the identifier of the edge (`Q31-P1082-03700d-e9540ac9-0`) as `node1` in the `qualifiers.tsv` file. We get one edge, so we know that the population in `1995` was `10136811`. Note that the qualifier edges are the same as any other edge in KGTK, having `id`, `node1`, `label` and `node2` columns:

In [29]:
result = !$kypher -i "$QUALIFIERS" --as "qualifiers" \
--match '(n1:`Q31-P1082-03700d-e9540ac9-0`)-[l]->(n2)' 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,node2;wikidatatype,rank
0,Q31-P1082-03700d-e9540ac9-0-P585-2a74fa-0,Q31-P1082-03700d-e9540ac9-0,P585,^1995-00-00T00:00:00Z/9,time,


Let's make the qualifier edge more readable by retrieving the label of the property: the following query combines the patterns of the previous two queries to retrieve the labels of the property and node2. The query omits the identifier of the qualifier edges to save space. Also, the headers of the two additional columns can be arbitrary, i.e., you can name them whatever you want; the names used follow a KGTK convention that enabled KGTK to automatically parse the output, which is useful if we want to use the output as an input to another KGTK command. The word before the `;` refers to one of the standard columns, and the name after the `;` refers to a property of that element. In this example, we used `label` as the column contains the label of the entity.

In [30]:
!$kypher -i qualifiers -i labels \
--match 'qual: (n1:`Q31-P1082-03700d-e9540ac9-0`)-[l {label: property}]->(n2), labels: (property)-[:label]->(property_label)' \
--return 'n1 as node1, property as label, n2 as node2, property_label as `label;label`' \
--limit 10 \
| column -t -s $'\t'

node1                        label  node2                    label;label
Q31-P1082-03700d-e9540ac9-0  P585   ^1995-00-00T00:00:00Z/9  'point in time'@en


Let's put all the values of `P1082` in a file, which we will conveniently name `Q31.P1082.tsv`

In [31]:
!$kypher -i claims \
--match '(n1:Q31)-[l:P1082]->(n2)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q31.P1082.tsv

Now we are going to combine the `P1082` edges of Belgium with the qualifiers. To do this we will run a query that uses the edges that we stored in `Q31.P1082.tsv`, and retrieve the qualifiers for each of those edges; the result of our query will be the qualifier edges of the head of state edges. To union the qualifier edges with the claim edges, we feed the output of the query to the `cat` command (concatenate), and then feed the output to the `sort2` command to sort the edges. The first 12 edges are shown below. We see a claim edge followed by the qualifiers defined for it.

This snippet illustrates that KGTK commands can be chained using the `/` chain operator to compose more complex workflows.

In [32]:
result = !$kypher -i qualifiers -i "$TEMP"/Q31.P1082.tsv \
--match 'P1082: ()-[l]->(), qual: (l)-[lq]->(n2)' \
--return 'lq as id, l as node1, lq.label as label, n2 as node2' \
/ cat -i - -i "$TEMP"/Q31.P1082.tsv \
/ sort2 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2
0,Q31-P1082-03700d-e9540ac9-0,Q31,P1082,+10136811
1,Q31-P1082-03700d-e9540ac9-0-P585-2a74fa-0,Q31-P1082-03700d-e9540ac9-0,P585,^1995-00-00T00:00:00Z/9
2,Q31-P1082-04bed1-dfb79a97-0,Q31,P1082,+9772419
3,Q31-P1082-04bed1-dfb79a97-0-P585-271261-0,Q31-P1082-04bed1-dfb79a97-0,P585,^1974-00-00T00:00:00Z/9
4,Q31-P1082-09cf36-da068a8a-0,Q31,P1082,+9153489
...,...,...,...,...
135,Q31-P1082-f90107-aedcfbe5-0-P585-cab8cf-0,Q31-P1082-f90107-aedcfbe5-0,P585,^2005-01-01T00:00:00Z/11
136,Q31-P1082-fa9783-4e530113-0,Q31,P1082,+10203008
137,Q31-P1082-fa9783-4e530113-0-P585-12d4de-0,Q31-P1082-fa9783-4e530113-0,P585,^1998-00-00T00:00:00Z/9
138,Q31-P1082-fb1f82-f3860fe1-0,Q31,P1082,+9646032


## Summary

- KGTK represents graphs in TSV files with standard columns `id`, `node1`, `label` and `node2`
- It is possible to include arbitrary additional columns in KGTK files
- The identifier of an edge can be used as a node in another edge enabling the representation of edges about edges
- KGTK provides a powerful query command based on Cypher as well as a host of other commands, type `kgtk --help` to see the list of commands.