# Creating a subset of Wikidata

This notebook illustrates counting the properties in a partitioned Wikidata KGTK edges file.

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \
-p wikidata_parts_path /data4/rogers/elicit/cache/datasets/wikidata-20200803/parts \
```

Here are some contraints on the contents of the input files:
- The input file starts with a KGTK header record.
 - In addition to the `id`, `node1`, `label`, and `node2` columns, the file is expected contain `rank`, `node2;wikidatatype`, and `lang` columns.
 - The `rank` column is not used in this script.
 - The `node2;wikidatatype` column is used to partion claims by Wikidata property datatype.
 - The `lang` column is used to extract English language sitelinks.
- The `id` column must contain a nonempty value.
 - It must follow certain patterns for claim and qualifier records.
 - Claim records contain 5 sections separated by hyphens (4 hyphens total).
 - Qualifier records contain 8 sections separated by dashes (7 dashes total).
- The first section of an `id` value must be the `node` value for the record.
 - The qualifier extraction operations depend upon this constraint. 
- In addition to the claims and qualifiers, the input file is expected to contain:
 - English language labels for all property entities appearing in the file.


### Parameters for invoking the notebook

| Parameter | Description | Default |
| --------- | ----------- | ------- |
| `wikidata_parts_path` | A folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |
| `temp_folder_path` | A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |
| `gzip_command` | The compression command for sorting. | 'pigz' |
| `sort_extras` | Extra parameters for the sort program. The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--temporary-directory ' + wikidata_parts_path |
| `unsorted_extension` | The file extension for unsorted files. | 'unsorted.tsv.gz' |
| `sorted_extension` | The file extension for sorted files. | 'tsv.gz' |
| `use_mgzip` | When True, use the mgzip program where appropriate for faster compression. | 'True' |
| `verbose` | When True, produce additional feedback messages. | 'True' |


In [1]:
# Parameters
wikidata_parts_path = '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts2'
temp_folder_path = wikidata_parts_path + '/temp'
gzip_command = 'pigz'
sort_extras = '--temporary-directory ' + wikidata_parts_path
unsorted_extension = 'unsorted.tsv.gz'
sorted_extension = 'tsv.gz'
use_mgzip = 'True'
verbose = 'True'


### Import the Python modules we will use in this script.
Almost all of this script consists of shell commands, so all we need to import is `os`, which we use for setup.

In [2]:
import os

### Set up environment variables and folders that we need
Define environment variables to pass the script parameters to the KGTK commands.

In [3]:
# folder with partitioned Wikidata files.
os.environ['WIKIDATA_PARTS'] = wikidata_parts_path
# temporary folder
os.environ['TEMP'] = temp_folder_path
# kgtk command to run
# os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug --timing"
# gzip command to run
os.environ['gzip'] = gzip_command
# extra parameters for sort
os.environ['SORT_EXTRAS'] = sort_extras
# The unsorted file extension.
os.environ['UNSORTED_EXTENSION'] = unsorted_extension
# The sorted file extension.
os.environ['SORTED_EXTENSION'] = sorted_extension
# The use_mgzip flag.
os.environ['USE_MGZIP'] = use_mgzip
# The verbose flag.
os.environ['VERBOSE'] = verbose


### Extract the Claims Entity list
Create `claims.node1.entity.counts`. This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `node1` column of the claim file, along with the matching English language labels. Wikidata items have `entityId` values that start with `Q`, while Wikidata properties have `entityId` values that start with `P`.

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 --column node1 \
 --label node1-entity-count \
/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \
 --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.node1.entity.counts.$SORTED_EXTENSION \
 --columns-to-lift node1 \
 --input-file-is-presorted \
 --label-file-is-presorted

Create `claims.label.entity.counts`. This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `label` column of the claim file, along with English language labels.

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 --column label \
 --label label-entity-count \
/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \
 --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.label.entity.counts.$SORTED_EXTENSION \
 --columns-to-lift node1 \
 --input-file-is-presorted \
 --label-file-is-presorted

Create `claims.node2.entity.counts`. This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `node2` column of the claim file, along with English language labels.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
b --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 -p ';; ^[PQ].*$' -o - \
/ unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --column node2 \
 --label node2-entity-count \
/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \
 --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.node2.entity.counts.$SORTED_EXTENSION \
 --columns-to-lift node1 \
 --input-file-is-presorted \
 --label-file-is-presorted

# Count the number of claims per Wikidata datatype

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.datatypes.$SORTED_EXTENSION \
 --column 'node2;wikidatatype'

### Extract the Property claims
Extract the claims with Wikidata properties in the node1 column.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 -p '^P.*$ ;;' -o $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION

### Count per Wikidata property datatype the number of claims with Wikidata properties in the node1 column.

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.node1.property.datatypes.$SORTED_EXTENSION \
 --column 'node2;wikidatatype'

### Count the properties for claims with Wikidata properties in the node1 column and lift the English label for each property.

In [9]:
!$kgtk unique --verbose=$VERBOSE \
 --use-mgzip $USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION \
 --column label \
 --label node1-property-count \
/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.node1.property.counts.$SORTED_EXTENSION \
 --columns-to-lift node1 \
 --input-file-is-presorted --label-file-is-presorted

Extract the claims with Wikidata properties in the label column.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 -p '; ^P.*$ ;' -o $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION

### Count per Wikidata property datatype the number of claims with Wikidata properties in the label column.

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.label.property.datatypes.$SORTED_EXTENSION \
 --column 'node2;wikidatatype'

### Count the properties for claims with Wikidata properties in the label column and lift the English label for each property.

In [9]:
!$kgtk unique --verbose=$VERBOSE \
 --use-mgzip $USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION \
 --column label \
 --label label-property-count \
/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.label.property.counts.$SORTED_EXTENSION \
 --columns-to-lift node1 \
 --input-file-is-presorted --label-file-is-presorted

Extract the claims with Wikidata properties in the node2 column.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \
 -p ';; ^P.*$' -o $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION

### Count per Wikidata property datatype the number of claims with Wikidata properties in the label column.

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.node2.property.datatypes.$SORTED_EXTENSION \
 --column 'node2;wikidatatype'

### Count the properties for claims with Wikidata properties in the label column and lift the English label for each property.

In [9]:
!$kgtk unique --verbose=$VERBOSE \
 --use-mgzip $USE_MGZIP \
 --input-file $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION \
 --column label \
 --label node2-property-count \
/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/claims.node2.property.counts.$SORTED_EXTENSION \
 --columns-to-lift node1 \
 --input-file-is-presorted --label-file-is-presorted