{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Combine Claims Edges with Quilifier Edges\n", "\n", "This notebook illustrates how to combine claims edges with qualifier edges to prepare Wikidata-like data for export to SPARQL.\n", "\n", "Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n", "\n", "```\n", "papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\n", "-p claims_input_path /data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims.tsv.gz \\\n", "-p qualifiers_input_path /data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/qualifiers.tsv.gz \\\n", "-p output_path /data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims-with-qualifiers.tsv.gz \\\n", "```\n", "\n", "The output file contains the claims and qualifiers edges interleaved.\n", "Each claim edge will by followed by matching qualifiers edges.\n", "Qualifier edges that do not match a claim edge will be omitted from the output.\n", "\n", "Here are some contraints on the contents of the input file:\n", "- Each input file starts with a KGTK header record.\n", " - The `id`, `node1`, `label`, and `node2` columns are required.\n", " - Any additional columns in either input file will be be passed on to the output file.\n", "- The `id` column in the claims file must contain a nonempty value.\n", "- The `node1` column in the qualifiers file must contain a nonempty value that matches an `id` clolumn valie in the claims file.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters for invoking the notebook\n", "\n", "| Parameter | Description | Default |\n", "| --------- | ----------- | ------- |\n", "| `claims_input_path` | A file containing the Wikidata-like claim edges to combine. | '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims.tsv.gz' |\n", "| `qualifiers_input_path` | A file containing the Wikidata-like qualifier edges to combine. | '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/qualifiers.tsv.gz' |\n", "| `output_path` | A file to receive the combined output. | '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims-with-qualifiers.tsv.gz' |\n", "| `temp_folder_path` | A folder that may be used for temporary files. | '/data3/rogers/tmp' |\n", "| `gzip_command` | The compression command for sorting. | 'pigz' |\n", "| `kgtk_command` | The kgtk commmand and any common options. | 'time kgtk --debug --timing' |\n", "| `kgtk_extension` | The file extension for generated KGTK files. Appending `.gz` implies gzip compression. | 'tsv.gz' |\n", "| `presorted_claims` | When True, the claims input file is already sorted in the `id` column. | 'False' |\n", "| `presorted_qualifiers` | When True, the qualifiers input file is already sorted in the `node1` column. | 'False' |\n", "| `sort_extras` | Extra parameters for the sort program. The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path |\n", "| `use_mgzip` | When True, use the mgzip program where appropriate for faster compression. | 'True' |\n", "| `verbose` | When True, produce additional feedback messages. | 'True' |\n", "| `cleanup` | When True, remove temporary files at the end of processing. | 'True' |\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "claims_input_path = '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims.tsv.gz'\n", "qualifiers_input_path = '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/qualifiers.tsv.gz'\n", "output_path = '/data3/rogers/kgtk/gd/kgtk/cache/datasets/wikidata-20200803/data/claims-with-qualifiers.tsv.gz'\n", "temp_folder_path = '/data3/rogers/tmp'\n", "gzip_command = 'pigz'\n", "kgtk_command = 'time kgtk --debug --timing'\n", "kgtk_extension = 'tsv.gz'\n", "presorted_claims = 'False'\n", "presorted_qualifiers = 'False'\n", "sort_extras = '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path\n", "use_mgzip = 'True'\n", "verbose = 'True'\n", "cleanup = 'True'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort the Claims Input Data Unless Presorted\n", "Sort the claims input data file by id.\n", "This may take a while." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if presorted_claims.lower() == \"true\": \n", " print('Using a presorted claims input file.')\n", " sorted_claims_path = claims_input_path \n", "else: \n", " print('Sorting the claims input file.')\n", " sorted_claims_path = temp_folder_path + 'claims.sorted-by-id.' + kgtk_extension \n", " !{kgtk_command} sort2 --verbose={verbose} --gzip-command={gzip_command} --use-mgzip={use_mgzip} \\\n", " --input-file {claims_input_path} \\\n", " --output-file {sorted_claims_path} \\\n", " --columns id\\\n", " --extra \"{sort_extras}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort the Qualifiers Input Data Unless Presorted\n", "Sort the qualifiers input data file by node1.\n", "This may take a while." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if presorted_qualifiers.lower() == \"true\": \n", " print('Using a presorted qualifiers input file.')\n", " sorted_qualifiers_path = qualifiers_input_path \n", "else: \n", " print('Sorting the qualifiers input file.')\n", " sorted_qualifiers_path = temp_folder_path + 'qualifiers.sorted-by-node1.' + kgtk_extension \n", " !{kgtk_command} sort2 --verbose={verbose} --gzip-command={gzip_command} --use-mgzip={use_mgzip} \\\n", " --input-file {qualifiers_input_path} \\\n", " --output-file {sorted_qualifiers_path} \\\n", " --columns node1\\\n", " --extra \"{sort_extras}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combine the Claims and Qualifiers Edges\n", "Combine the claims and qualifiers edges. Qualifier edges will immediately follow the claim edge they match.\n", "Qualifier edges that do not match a claim edge will be omitted.\n", "This may take a while." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "!{kgtk_command} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted --left-join --join-output \\\n", " --input-file {sorted_claims_path} \\\n", " --input-key id \\\n", " --filter-file {sorted_qualifiers_path} \\\n", " --filter-key node1 \\\n", " --output-file {output_path}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Up If Needed\n", "Remove any temporary files.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "if presorted_claims.lower() != \"true\": \n", " print('Removing the sorted claims temporary file: %s' % sorted_qualifiers_path)\n", " os.remove(sorted_claims_path) \n", "if presorted_qualifiers.lower() != \"true\": \n", " print('Removing the sorted qualifiers temporary file: %s' % sorted_qualifiers_path)\n", " os.remove(sorted_qualifiers_path)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }