{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating a subset of Wikidata\n",
    "\n",
    "This notebook illustrates counting the properties in a partitioned Wikidata KGTK edges file.\n",
    "\n",
    "Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n",
    "\n",
    "```\n",
    "papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\n",
    "-p wikidata_parts_path /data4/rogers/elicit/cache/datasets/wikidata-20200803/parts \\\n",
    "```\n",
    "\n",
    "Here are some contraints on the contents of the input files:\n",
    "- The input file starts with a KGTK header record.\n",
    "  - In addition to the `id`, `node1`, `label`, and `node2` columns, the file is expected contain `rank`, `node2;wikidatatype`, and `lang` columns.\n",
    "  - The `rank` column is not used in this script.\n",
    "  - The `node2;wikidatatype` column is used to partion claims by Wikidata property datatype.\n",
    "  - The `lang` column is used to extract English language sitelinks.\n",
    "- The `id` column must contain a nonempty value.\n",
    "  - It must follow certain patterns for claim and qualifier records.\n",
    "    - Claim records contain 5 sections separated by hyphens (4 hyphens total).\n",
    "    - Qualifier records contain 8 sections separated by dashes (7 dashes total).\n",
    "- The first section of an `id` value must be the `node` value for the record.\n",
    "  - The qualifier extraction operations depend upon this constraint. \n",
    "- In addition to the claims and qualifiers, the input file is expected to contain:\n",
    "  - English language labels for all property entities appearing in the file.\n"
    ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters for invoking the notebook\n",
    "\n",
    "| Parameter | Description | Default |\n",
    "| --------- | ----------- | ------- |\n",
    "| `wikidata_parts_path` | A folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |\n",
    "| `temp_folder_path` |    A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |\n",
    "| `gzip_command` |        The compression command for sorting. | 'pigz' |\n",
    "| `sort_extras` |         Extra parameters for the sort program.  The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--temporary-directory ' + wikidata_parts_path |\n",
    "| `unsorted_extension` |  The file extension for unsorted files. | 'unsorted.tsv.gz' |\n",
    "| `sorted_extension` |    The file extension for sorted files. | 'tsv.gz' |\n",
    "| `use_mgzip` |           When True, use the mgzip program where appropriate for faster compression. | 'True' |\n",
    "| `verbose` |             When True, produce additional feedback messages. | 'True' |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "wikidata_parts_path = '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts2'\n",
    "temp_folder_path =    wikidata_parts_path + '/temp'\n",
    "gzip_command =        'pigz'\n",
    "sort_extras =         '--temporary-directory ' + wikidata_parts_path\n",
    "unsorted_extension =  'unsorted.tsv.gz'\n",
    "sorted_extension =    'tsv.gz'\n",
    "use_mgzip =           'True'\n",
    "verbose =             'True'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import the Python modules we will use in this script.\n",
    "Almost all of this script consists of shell commands, so all we need to import is `os`, which we use for setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up environment variables and folders that we need\n",
    "Define environment variables to pass the script parameters to the KGTK commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# folder with partitioned Wikidata files.\n",
    "os.environ['WIKIDATA_PARTS'] =     wikidata_parts_path\n",
    "# temporary folder\n",
    "os.environ['TEMP'] =               temp_folder_path\n",
    "# kgtk command to run\n",
    "# os.environ['kgtk'] =             \"kgtk\"\n",
    "os.environ['kgtk'] =               \"time kgtk --debug --timing\"\n",
    "# gzip command to run\n",
    "os.environ['gzip'] =               gzip_command\n",
    "# extra parameters for sort\n",
    "os.environ['SORT_EXTRAS'] =        sort_extras\n",
    "# The unsorted file extension.\n",
    "os.environ['UNSORTED_EXTENSION'] = unsorted_extension\n",
    "# The sorted file extension.\n",
    "os.environ['SORTED_EXTENSION'] =   sorted_extension\n",
    "# The use_mgzip flag.\n",
    "os.environ['USE_MGZIP'] =          use_mgzip\n",
    "# The verbose flag.\n",
    "os.environ['VERBOSE'] =            verbose\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the Claims Entity list\n",
    "Create `claims.node1.entity.counts`.  This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `node1` column of the claim file, along with the matching English language labels. Wikidata items have `entityId` values that start with `Q`, while Wikidata properties have `entityId` values that start with `P`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " --column      node1 \\\n",
    " --label node1-entity-count \\\n",
    "/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \\\n",
    " --label-file  $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.node1.entity.counts.$SORTED_EXTENSION \\\n",
    " --columns-to-lift node1 \\\n",
    " --input-file-is-presorted \\\n",
    " --label-file-is-presorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create `claims.label.entity.counts`.  This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `label` column of the claim file, along with English language labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " --column      label \\\n",
    " --label label-entity-count \\\n",
    "/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \\\n",
    " --label-file  $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.label.entity.counts.$SORTED_EXTENSION \\\n",
    " --columns-to-lift node1 \\\n",
    " --input-file-is-presorted \\\n",
    " --label-file-is-presorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create `claims.node2.entity.counts`.  This is a KGTK edge file that contains a count of all the Wikidata `entityId` values in the `node2` column of the claim file, along with English language labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\nb",
    " --input-file  $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " -p ';; ^[PQ].*$' -o - \\\n",
    "/ unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --column      node2 \\\n",
    " --label node2-entity-count \\\n",
    "/ lift --verbose=${VERBOSE} --use-mgzip=$USE_MGZIP \\\n",
    " --label-file  $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.node2.entity.counts.$SORTED_EXTENSION \\\n",
    " --columns-to-lift node1 \\\n",
    " --input-file-is-presorted \\\n",
    " --label-file-is-presorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Count the number of claims per Wikidata datatype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.datatypes.$SORTED_EXTENSION \\\n",
    " --column      'node2;wikidatatype'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the Property claims\n",
    "Extract the claims with Wikidata properties in the node1 column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\n",
    " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " -p '^P.*$ ;;' -o $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count per Wikidata property datatype the number of claims with Wikidata properties in the node1 column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.node1.property.datatypes.$SORTED_EXTENSION \\\n",
    " --column      'node2;wikidatatype'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the properties for claims with Wikidata properties in the node1 column and lift the English label for each property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE \\\n",
    " --use-mgzip $USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.node1.property.rows.$SORTED_EXTENSION \\\n",
    " --column      label \\\n",
    " --label       node1-property-count \\\n",
    "/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --label-file       $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n",
    " --output-file      $WIKIDATA_PARTS/claims.node1.property.counts.$SORTED_EXTENSION \\\n",
    " --columns-to-lift  node1 \\\n",
    " --input-file-is-presorted --label-file-is-presorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extract the claims with Wikidata properties in the label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\n",
    " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " -p '; ^P.*$ ;' -o $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count per Wikidata property datatype the number of claims with Wikidata properties in the label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.label.property.datatypes.$SORTED_EXTENSION \\\n",
    " --column      'node2;wikidatatype'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the properties for claims with Wikidata properties in the label column and lift the English label for each property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE \\\n",
    " --use-mgzip $USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.label.property.rows.$SORTED_EXTENSION \\\n",
    " --column      label \\\n",
    " --label       label-property-count \\\n",
    "/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --label-file       $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n",
    " --output-file      $WIKIDATA_PARTS/claims.label.property.counts.$SORTED_EXTENSION \\\n",
    " --columns-to-lift  node1 \\\n",
    " --input-file-is-presorted --label-file-is-presorted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extract the claims with Wikidata properties in the node2 column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \\\n",
    " --input-file $WIKIDATA_PARTS/claims.$SORTED_EXTENSION \\\n",
    " -p ';; ^P.*$' -o $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count per Wikidata property datatype the number of claims with Wikidata properties in the label column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION \\\n",
    " --output-file $WIKIDATA_PARTS/claims.node2.property.datatypes.$SORTED_EXTENSION \\\n",
    " --column      'node2;wikidatatype'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count the properties for claims with Wikidata properties in the label column and lift the English label for each property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "!$kgtk unique --verbose=$VERBOSE \\\n",
    " --use-mgzip $USE_MGZIP \\\n",
    " --input-file  $WIKIDATA_PARTS/claims.node2.property.rows.$SORTED_EXTENSION \\\n",
    " --column      label \\\n",
    " --label       node2-property-count \\\n",
    "/ lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \\\n",
    " --label-file       $WIKIDATA_PARTS/labels.en.$SORTED_EXTENSION \\\n",
    " --output-file      $WIKIDATA_PARTS/claims.node2.property.counts.$SORTED_EXTENSION \\\n",
    " --columns-to-lift  node1 \\\n",
    " --input-file-is-presorted --label-file-is-presorted"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}