# Find ambiguous Items

This notebook illustrates how to create a subset of Wikidata of ambiguous items. Given a list of qnodes the notebook returns additional qnodes with the same label string or the same alias string.

Parameters are in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill 'Example 11 - Find Ambiguous Items.ipynb'
-p wikidata_parts_path /lfs1/ktyao/Shared/KGTK/datasets/wikidata-20200803-v2/useful_wikidata_files
-p instance_csv_file /lfs1/ktyao/data/iswc-2019/t2dv2_gt.csv
-p instance_qnode_col kg_id
-p subset_name iscw
-p output_path /home/ktyao/dev/kgtk-cache
-p cache_path /home/ktyao/dev/kgtk-cache
-p delete_database no

```

### Parameters for invoking the notebook

- `wikidata_parts_path`: a folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz`
- `instance_csv_file`: the path to the csv containing the input qnodes
- `instance_qnode_col`: the name of the csv table column containing the input qnodes
- `subset_name`: a name of the subset being created.
- `output_path`: the path where a folder will be created to hold the KGTK files for the subset. A folder named `subset_name` will be createed in this filder.
- `cache_path`: the path of a folder where the Kypher SQL database will be created.
- `delete_database`: whether to delete the SQL database before running the notebook: "" or "no" means don't delete it.

### Sample output

Pairs of qnodes that have common label and/or alias strings.
```
$ zcat shares_name.iscw.tsv.gz | head
node1 node1;label node2 node2;label label
Q100 'Boston'@en Q26664318 'Royal Court Theatre, Wigan'@en Pshares_name
Q100 'Boston'@en Q26664318 'The Hub'@en-gb Pshares_name
Q100 'Boston'@en Q27714978 'Bostonia (Boston, Mass. : 1986)'@en Pshares_name
Q100 'Boston'@en-ca Q26664318 'Royal Court Theatre, Wigan'@en Pshares_name
Q100 'Boston'@en-ca Q26664318 'The Hub'@en-gb Pshares_name
Q100 'Boston'@en-ca Q27714978 'Bostonia (Boston, Mass. : 1986)'@en Pshares_name
Q100 'Boston'@en-gb Q26664318 'Royal Court Theatre, Wigan'@en Pshares_name
Q100 'Boston'@en-gb Q26664318 'The Hub'@en-gb Pshares_name
Q100 'Boston'@en-gb Q27714978 'Bostonia (Boston, Mass. : 1986)'@en Pshares_name
```

All qnodes including input qnodes and ambiguous qnodes.
```
$ zcat qnodes_combined.iscw.tsv.gz | head
node1 label node2
Qiscw Pcontains_entity Q100
Qiscw Pcontains_entity Q1000
Qiscw Pcontains_entity Q1001437
Qiscw Pcontains_entity Q1001910
Qiscw Pcontains_entity Q1002142
Qiscw Pcontains_entity Q1002860
Qiscw Pcontains_entity Q1003177
Qiscw Pcontains_entity Q1004657
Qiscw Pcontains_entity Q1005
```

In [1]:
# Parameters
wikidata_parts_path = "/lfs1/ktyao/Shared/KGTK/datasets/wikidata-20200803-v2/useful_wikidata_files"
instance_csv_file="/lfs1/ktyao/data/iswc-2019/t2dv2_gt.csv"
instance_qnode_col="kg_id"
subset_name="iscw"
output_path = "/home/ktyao/dev/kgtk-cache"
cache_path = "/home/ktyao/dev/kgtk-cache"
delete_database = "no"

In [2]:
temp_folder = subset_name + "-temp"
output_folder = subset_name

In [3]:
import csv
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary.

In [4]:
def run_command(command, substitution_dictionary = {}):
 """Run a templetized command."""
 cmd = command.replace("NAME", subset_name)
 for k, v in substitution_dictionary.items():
 cmd = cmd.replace(k, v)
 
 print(cmd)
 output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
 print(output.stdout)
 print(output.stderr)

### Set up environment variables and folders that we need
We need to define environment variables to pass to the KGTK commands.

In [5]:
# folder containing wikidata broken down into smaller files.
os.environ['WIKIDATA_PARTS'] = wikidata_parts_path
# name of the dataset
os.environ['NAME'] = subset_name
# folder where to put the output
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
# temporary folder
os.environ['TEMP'] = "{}/{}".format(output_path, temp_folder)
# kgtk command to run
os.environ['kgtk'] = "kgtk"
# os.environ['kgtk'] = "time kgtk --debug"
# absolute path of the db
if cache_path:
 os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
 os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
# alias zcat
has_zcat = !command -v zcat
if has_zcat:
 # for linux
 %alias gzcat zcat
else:
 # for mac
 %alias gzcat gzcat

In [6]:
cd $output_path

/home/ktyao/dev/kgtk-cache


In [7]:
!mkdir $output_folder
!mkdir $temp_folder

mkdir: cannot create directory ‘iscw’: File exists
mkdir: cannot create directory ‘iscw-temp’: File exists


In [8]:
!rm $OUT/*.tsv $OUT/*.tsv.gz
!rm $TEMP/*.tsv $TEMP/*.tsv.gz

rm: cannot remove '/home/ktyao/dev/kgtk-cache/iscw/*.tsv': No such file or directory


In [9]:
if delete_database and delete_database != "no":
 print("Deleted database")
 !rm $STORE



### Extract Q-nodes from input file

In [10]:
df = pd.read_csv(instance_csv_file, lineterminator='\n')
result = pd.DataFrame(df[instance_qnode_col].unique())
result.columns = ['node2']
result['node1'] = 'n1'
result['label'] = 'l'
result = result[['node1', 'label', 'node2']]
result.to_csv(f"{output_path}/{temp_folder}/instance_qnodes.tsv", sep="\t", quoting=csv.QUOTE_NONE, index=None)
!head $TEMP/instance_qnodes.tsv
!wc $TEMP/instance_qnodes.tsv

node1	label	node2
n1	l	Q470336
n1	l	Q494085
n1	l	Q105702
n1	l	Q26060
n1	l	Q2345
n1	l	Q42198
n1	l	Q320588
n1	l	Q132689
n1	l	Q45602
 11892 35676 157305 /home/ktyao/dev/kgtk-cache/iscw-temp/instance_qnodes.tsv


In [11]:
command = "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $TEMP/instance_qnodes.tsv \
 --graph-cache $STORE \
 -o $TEMP/qnodelist.NAME.tsv.gz \
 --match 'instance: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' \
 --return 'distinct n1 as node1, l.label as label, n2 as node2'"
run_command(command)
%gzcat $TEMP/qnodelist.$NAME.tsv.gz | head
!echo wc -l $TEMP/qnodelist.$NAME.tsv.gz
%gzcat $TEMP/qnodelist.$NAME.tsv.gz | wc -l

$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $TEMP/instance_qnodes.tsv --graph-cache $STORE -o $TEMP/qnodelist.iscw.tsv.gz --match 'instance: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' --return 'distinct n1 as node1, l.label as label, n2 as node2'


node1	label	node2
Q100	isa	Q1549591
Q100	isa	Q21518270
Q100	isa	Q1093829
Q1000	isa	Q6256
Q1000	isa	Q3624078
Q1000	isa	Q179023
Q1001437	isa	Q4830453
Q1001910	isa	Q5
Q1002142	isa	Q219557

gzip: stdout: Broken pipe
wc -l /home/ktyao/dev/kgtk-cache/iscw-temp/qnodelist.iscw.tsv.gz
16584


### Extending KG to include nodes with ambiguous names

Find node2s where we have node1/label/node1_label in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_label

In [12]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n2)-[:alias]->(n1_label), label: (n2)-[:label]->(n2_label)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, "Pshares_name" as label' \
 | gzip > $TEMP/shares_name.label_alias.$NAME.tsv.gz 

Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias

In [13]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), label: (n2)-[:label]->(n1_alias)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_alias as `node2;label`, "Pshares_name" as label' \
 | gzip > $TEMP/shares_name.alias_label.$NAME.tsv.gz

Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/alias/node2_alias in Wikidata such that node2_alias = node1_alias

In [14]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz -i $WIKIDATA_PARTS/part.alias.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), alias: (n1)-[:alias]->(n1_alias), alias: (n2)-[:alias]->(n1_alias), label: (n2)-[:label]->(n2_label)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n2_label as `node2;label`, "Pshares_name" as label' \
 | gzip > $TEMP/shares_name.alias_alias.$NAME.tsv.gz

Find node2s where we have node1/alias/node1_alias in qnodelist such that there exists a node2/label/node2_label in Wikidata such that node2_label = node1_alias

In [15]:
!$kgtk query -i $TEMP/qnodelist.$NAME.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \
--match 'qnodelist: (n1)-[]->(), label: (n1)-[:label]->(n1_label), label: (n2)-[:label]->(n1_label)' \
--where 'n1 != n2' \
--return 'distinct n1 as node1, n1_label as `node1;label`, n2 as node2, n1_label as `node2;label`, "Pshares_name" as label' \
| gzip > $TEMP/shares_name.label_label.$NAME.tsv.gz

Combine the above four files into one file

In [16]:
!ls $TEMP/shares_name.*.$NAME.tsv.gz | xargs echo "-i" | xargs kgtk cat | gzip > $OUT/shares_name.$NAME.tsv.gz

In [17]:
%gzcat $OUT/shares_name.$NAME.tsv.gz | wc

 143901 951117 9511726


In [18]:
!echo Unique ambiguous Q-nodes
%gzcat $OUT/shares_name.$NAME.tsv.gz | awk -F'\t' '{print $3}' | tail -n +2 | uniq | wc -l

Unique ambiguous Q-nodes
132800


### Generate combined KG of input Q-nodes and ambiguous Q-nodes

In [19]:
command = "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $OUT/shares_name.NAME.tsv.gz \
 --graph-cache $STORE \
 -o $TEMP/qnodelist_ambiguous.NAME.tsv.gz \
 --match 'shares: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' \
 --return 'distinct n1 as node1, l.label as label, n2 as node2'"
run_command(command)

$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $OUT/shares_name.iscw.tsv.gz --graph-cache $STORE -o $TEMP/qnodelist_ambiguous.iscw.tsv.gz --match 'shares: ()-[]-(n1), isa: (n1)-[l:isa]->(n2)' --return 'distinct n1 as node1, l.label as label, n2 as node2'




In [20]:
!kgtk cat -i $TEMP/qnodelist.$NAME.tsv.gz -i $TEMP/qnodelist_ambiguous.$NAME.tsv.gz \
 | gzip > $OUT/isa_combined.$NAME.tsv.gz

In [21]:
command = """kgtk query -i $OUT/isa_combined.NAME.tsv.gz \
 --graph-cache $STORE \
 -o $OUT/qnodes_combined.NAME.tsv.gz \
 --match '(n2)-[]->()' \
 --return 'distinct "QNAME" as node1, "Pcontains_entity" as label, n2 as node2'"""
run_command(command)

kgtk query -i $OUT/isa_combined.iscw.tsv.gz --graph-cache $STORE -o $OUT/qnodes_combined.iscw.tsv.gz --match '(n2)-[]->()' --return 'distinct "Qiscw" as node1, "Pcontains_entity" as label, n2 as node2'


