# Combine Claims Edges with Quilifier Edges

This notebook illustrates how to combine claims edges with qualifier edges to prepare Wikidata-like data for export to SPARQL.

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \
-p claims_input_path /data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims.tsv.gz \
-p qualifiers_input_path /data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/qualifiers.tsv.gz \
-p output_path /data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims-with-qualifiers.tsv.gz \
```

The output file contains the claims and qualifiers edges interleaved.
Each claim edge will by followed by matching qualifiers edges.
Qualifier edges that do not match a claim edge will be omitted from the output.

Here are some contraints on the contents of the input file:
- Each input file starts with a KGTK header record.
  - The `id`, `node1`, `label`, and `node2` columns are required.
  - Any additional columns in either input file will be be passed on to the output file.
- The `id` column in the claims file must contain a nonempty value.
- The `node1` column in the qualifiers file must contain a nonempty value that matches an `id` clolumn valie in the claims file.


### Parameters for invoking the notebook

| Parameter | Description | Default |
| --------- | ----------- | ------- |
| `claims_input_path` | A file containing the Wikidata-like claim edges to combine. | '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims.tsv.gz' |
| `qualifiers_input_path` | A file containing the Wikidata-like qualifier edges to combine. | '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/qualifiers.tsv.gz' |
| `output_path` |         A file to receive the combined output. | '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims-with-qualifiers.tsv.gz' |
| `temp_folder_path` |    A folder that may be used for temporary files. | '/data3/rogers/tmp' |
| `gzip_command` |        The compression command for sorting. | 'pigz' |
| `kgtk_command` |        The kgtk commmand and any common options. | 'time kgtk --debug --timing' |
| `kgtk_extension` |      The file extension for generated KGTK files. Appending `.gz` implies gzip compression. | 'tsv.gz' |
| `presorted_claims` |    When True, the claims input file is already sorted in the `id` column. | 'False' |
| `presorted_qualifiers` | When True, the qualifiers input file is already sorted in the `node1` column. | 'False' |
| `sort_extras` |         Extra parameters for the sort program.  The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path |
| `use_mgzip` |           When True, use the mgzip program where appropriate for faster compression. | 'True' |
| `verbose` |             When True, produce additional feedback messages. | 'True' |
| `cleanup` |             When True, remove temporary files at the end of processing. | 'True' |


In [13]:
# Parameters
claims_input_path = '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims.tsv.gz'
qualifiers_input_path = '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/qualifiers.tsv.gz'
output_path = '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims-with-qualifiers.tsv.gz'
temp_folder_path =     '/data3/rogers/tmp'
gzip_command =         'pigz'
kgtk_command =         'time kgtk --debug --timing'
kgtk_extension =       'tsv.gz'
presorted_claims =     'False'
presorted_qualifiers = 'False'
sort_extras =          '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path
use_mgzip =            'True'
verbose =              'True'
cleanup =              'True'


### Sort the Claims Input Data Unless Presorted
Sort the claims input data file by id.
This may take a while.

In [None]:
if presorted_claims.lower() == "true": 
    print('Using a presorted claims input file.')
    sorted_claims_path = claims_input_path 
else: 
    print('Sorting the claims input file.')
    sorted_claims_path = temp_folder_path + 'claims.sorted-by-id.' + kgtk_extension 
    !{kgtk_command} sort2 --verbose={verbose} --gzip-command={gzip_command} --use-mgzip={use_mgzip} \
 --input-file {claims_input_path} \
 --output-file {sorted_claims_path} \
 --columns     id\
 --extra       "{sort_extras}"

### Sort the Qualifiers Input Data Unless Presorted
Sort the qualifiers input data file by node1.
This may take a while.

In [None]:
if presorted_qualifiers.lower() == "true": 
    print('Using a presorted qualifiers input file.')
    sorted_qualifiers_path = qualifiers_input_path 
else: 
    print('Sorting the qualifiers input file.')
    sorted_qualifiers_path = temp_folder_path + 'qualifiers.sorted-by-node1.' + kgtk_extension 
    !{kgtk_command} sort2 --verbose={verbose} --gzip-command={gzip_command} --use-mgzip={use_mgzip} \
 --input-file {qualifiers_input_path} \
 --output-file {sorted_qualifiers_path} \
 --columns     node1\
 --extra       "{sort_extras}"

### Combine the Claims and Qualifiers Edges
Combine the claims and qualifiers edges.  Qualifier edges will immediately follow the claim edge they match.
Qualifier edges that do not match a claim edge will be omitted.
This may take a while.

In [8]:
!{kgtk_command} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted --left-join --join-output \
 --input-file {sorted_claims_path} \
 --input-key id \
 --filter-file {sorted_qualifiers_path} \
 --filter-key node1 \
 --output-file {output_path}

### Clean Up If Needed
Remove any temporary files.


In [None]:
import os
if presorted_claims.lower() != "true": 
    print('Removing the sorted claims temporary file: %s' % sorted_qualifiers_path)
    os.remove(sorted_claims_path) 
if presorted_qualifiers.lower() != "true": 
    print('Removing the sorted qualifiers temporary file: %s' % sorted_qualifiers_path)
    os.remove(sorted_qualifiers_path)