{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating a subset of Wikidata\n", "\n", "This notebook illustrates counting the properties in a partitioned Wikidata KGTK edges file.\n", "\n", "Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n", "\n", "```\n", "papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\n", "-p wikidata_parts_path /data4/rogers/elicit/cache/datasets/wikidata-20200803/parts \\\n", "```\n", "\n", "Here are some contraints on the contents of the input files:\n", "- The input file starts with a KGTK header record.\n", " - In addition to the `id`, `node1`, `label`, and `node2` columns, the file is expected contain `rank`, `node2;wikidatatype`, and `lang` columns.\n", " - The `rank` column is not used in this script.\n", " - The `node2;wikidatatype` column is used to partion claims by Wikidata property datatype.\n", " - The `lang` column is used to extract English language sitelinks.\n", "- The `id` column must contain a nonempty value.\n", " - It must follow certain patterns for claim and qualifier records.\n", " - Claim records contain 5 sections separated by hyphens (4 hyphens total).\n", " - Qualifier records contain 8 sections separated by dashes (7 dashes total).\n", "- The first section of an `id` value must be the `node` value for the record.\n", " - The qualifier extraction operations depend upon this constraint. \n", "- In addition to the claims and qualifiers, the input file is expected to contain:\n", " - English language labels for all property entities appearing in the file.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters for invoking the notebook\n", "\n", "| Parameter | Description | Default |\n", "| --------- | ----------- | ------- |\n", "| `wikidata_parts_path` | A folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |\n", "| `temp_folder_path` | A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |\n", "| `gzip_command` | The compression command for sorting. | 'pigz' |\n", "| `sort_extras` | Extra parameters for the sort program. The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--temporary-directory ' + wikidata_parts_path |\n", "| `unsorted_extension` | The file extension for unsorted files. | 'unsorted.tsv.gz' |\n", "| `sorted_extension` | The file extension for sorted files. | 'tsv.gz' |\n", "| `use_mgzip` | When True, use the mgzip program where appropriate for faster compression. | 'True' |\n", "| `verbose` | When True, produce additional feedback messages. | 'True' |\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Parameters\n", "wikidata_parts_path = '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts2'\n", "temp_folder_path = wikidata_parts_path + '/temp'\n", "gzip_command = 'pigz'\n", "sort_extras = '--temporary-directory ' + wikidata_parts_path\n", "unsorted_extension = 'unsorted.tsv.gz'\n", "sorted_extension = 'tsv.gz'\n", "use_mgzip = 'True'\n", "verbose = 'True'\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import the Python modules we will use in this script.\n", "Almost all of this script consists of shell commands, so all we need to import is `os`, which we use for setup." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up environment variables and folders that we need\n", "Define environment variables to pass the script parameters to the KGTK commands." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# folder with partitioned Wikidata files.\n", "os.environ['WIKIDATA_PARTS'] = wikidata_parts_path\n", "# temporary folder\n", "os.environ['TEMP'] = temp_folder_path\n", "# kgtk command to run\n", "# os.environ['kgtk'] = \"kgtk\"\n", "os.environ['kgtk'] = \"time kgtk --debug --timing\"\n", "# gzip command to run\n", "os.environ['gzip'] = gzip_command\n", "# extra parameters for sort\n", "os.environ['SORT_EXTRAS'] = sort_extras\n", "# The unsorted file extension.\n", "os.environ['UNSORTED_EXTENSION'] = unsorted_extension\n", "# The sorted file extension.\n", "os.environ['SORTED_EXTENSION'] = sorted_extension\n", "# The use_mgzip flag.\n", "os.environ['USE_MGZIP'] = use_mgzip\n", "# The verbose flag.\n", "os.environ['VERBOSE'] = verbose\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract the Claims Entity list\n", "Create `claims.node1.entity.counts`. This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `node1` column of the claim file, along with the matching English language labels. Wikidata items have `entityId` values that start with `Q`, while Wikidata properties have `entityId` values that start with `P`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " --column node1 \\\n", " --label node1-entity-count \\\n", "/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \\\n", " --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.node1.entity.counts.$SORTED_EXTENSION \\\n", " --columns-to-lift node1 \\\n", " --input-file-is-presorted \\\n", " --label-file-is-presorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create `claims.label.entity.counts`. This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `label` column of the claim file, along with English language labels." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " --column label \\\n", " --label label-entity-count \\\n", "/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \\\n", " --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.label.entity.counts.$SORTED_EXTENSION \\\n", " --columns-to-lift node1 \\\n", " --input-file-is-presorted \\\n", " --label-file-is-presorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create `claims.node2.entity.counts`. This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `node2` column of the claim file, along with English language labels." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\nb", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " -p ';; ^[PQ].*$' -o - \\\n", "/ unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --column node2 \\\n", " --label node2-entity-count \\\n", "/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \\\n", " --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.node2.entity.counts.$SORTED_EXTENSION \\\n", " --columns-to-lift node1 \\\n", " --input-file-is-presorted \\\n", " --label-file-is-presorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Count the number of claims per Wikidata datatype" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.datatypes.$SORTED_EXTENSION \\\n", " --column 'node2;wikidatatype'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract the Property claims\n", "Extract the claims with Wikidata properties in the node1 column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\n", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " -p '^P.*$ ;;' -o $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count per Wikidata property datatype the number of claims with Wikidata properties in the node1 column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.node1.property.datatypes.$SORTED_EXTENSION \\\n", " --column 'node2;wikidatatype'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the properties for claims with Wikidata properties in the node1 column and lift the English label for each property." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE \\\n", " --use-mgzip $USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION \\\n", " --column label \\\n", " --label node1-property-count \\\n", "/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.node1.property.counts.$SORTED_EXTENSION \\\n", " --columns-to-lift node1 \\\n", " --input-file-is-presorted --label-file-is-presorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract the claims with Wikidata properties in the label column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\n", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " -p '; ^P.*$ ;' -o $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count per Wikidata property datatype the number of claims with Wikidata properties in the label column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.label.property.datatypes.$SORTED_EXTENSION \\\n", " --column 'node2;wikidatatype'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the properties for claims with Wikidata properties in the label column and lift the English label for each property." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE \\\n", " --use-mgzip $USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION \\\n", " --column label \\\n", " --label label-property-count \\\n", "/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.label.property.counts.$SORTED_EXTENSION \\\n", " --columns-to-lift node1 \\\n", " --input-file-is-presorted --label-file-is-presorted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract the claims with Wikidata properties in the node2 column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\n", " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n", " -p ';; ^P.*$' -o $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count per Wikidata property datatype the number of claims with Wikidata properties in the label column." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.node2.property.datatypes.$SORTED_EXTENSION \\\n", " --column 'node2;wikidatatype'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the properties for claims with Wikidata properties in the label column and lift the English label for each property." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "!$kgtk unique --verbose=$VERBOSE \\\n", " --use-mgzip $USE_MGZIP \\\n", " --input-file $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION \\\n", " --column label \\\n", " --label node2-property-count \\\n", "/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n", " --label-file $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n", " --output-file $WIKIDATA_PARTS/claims.node2.property.counts.$SORTED_EXTENSION \\\n", " --columns-to-lift node1 \\\n", " --input-file-is-presorted --label-file-is-presorted" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }