{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Calculating Pagerank on Wikidata" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import os" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: MY=/Users/pedroszekely/data/wikidata-20200504\n", "env: WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504\n" ] } ], "source": [ "%env MY=/Users/pedroszekely/data/wikidata-20200504\n", "%env WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to filter the wikidata edge file to remove all edges where `node2` is a literal. \n", "We can do this by running `ifexists` to keep edges where `node2` also appears in `node1`.\n", "This takes 2-3 hours on a laptop." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "real\t121m58.689s\n", "user\t129m53.195s\n", "sys\t6m21.092s\n" ] } ], "source": [ "!time gzcat \"$WD/wikidata_edges_20200504.tsv.gz\" \\\n", " | kgtk ifexists --filter-on \"$WD/wikidata_edges_20200504.tsv.gz\" --input-keys node2 --filter-keys node1 \\\n", " | gzip > \"$MY/wikidata-item-edges.tsv.gz\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 460763981 3225347876 32869769062\n" ] } ], "source": [ "!gzcat $MY/wikidata-item-edges.tsv.gz | wc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have 460 million edges that connect items to other items, let's make sure this is what we want before spending a lot of time computing pagerank" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id\tnode1\tlabel\tnode2\trank\tnode2;magnitude\tnode2;unit\tnode2;date\tnode2;item\tnode2;lower\tnode2;upper\tnode2;latitude\tnode2;longitude\tnode2;precision\tnode2;calendar\tnode2;entity-type\n", "Q8-P31-1\tQ8\tP31\tQ331769\tnormal\t\t\t\tQ331769\t\t\t\t\t\t\titem\n", "Q8-P31-2\tQ8\tP31\tQ60539479\tnormal\t\t\t\tQ60539479\t\t\t\t\t\t\titem\n", "Q8-P31-3\tQ8\tP31\tQ9415\tnormal\t\t\t\tQ9415\t\t\t\t\t\t\titem\n", "Q8-P1343-1\tQ8\tP1343\tQ20743760\tnormal\t\t\t\tQ20743760\t\t\t\t\t\t\titem\n", "Q8-P1343-2\tQ8\tP1343\tQ1970746\tnormal\t\t\t\tQ1970746\t\t\t\t\t\t\titem\n", "Q8-P1343-3\tQ8\tP1343\tQ19180675\tnormal\t\t\t\tQ19180675\t\t\t\t\t\t\titem\n", "Q8-P461-1\tQ8\tP461\tQ169251\tnormal\t\t\t\tQ169251\t\t\t\t\t\t\titem\n", "Q8-P279-1\tQ8\tP279\tQ16748867\tnormal\t\t\t\tQ16748867\t\t\t\t\t\t\titem\n", "Q8-P460-1\tQ8\tP460\tQ935526\tnormal\t\t\t\tQ935526\t\t\t\t\t\t\titem\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-item-edges.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat $MY/wikidata-item-edges.tsv.gz | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do a sanity check to make sure that we have the edges that we want.\n", "We can do this by counting how many edges of each `entity-type`. \n", "Good news, we only have items and properties." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "item\tcount\t460737401\n", "property\tcount\t26579\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-item-edges.tsv.gz: uncompress failed\n", "\n", "real\t21m44.450s\n", "user\t21m29.078s\n", "sys\t0m7.958s\n" ] } ], "source": [ "!time gzcat $MY/wikidata-item-edges.tsv.gz | kgtk unique $MY/wikidata-item-edges.tsv.gz --column 'node2;entity-type'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only needd `node`, `label` and `node2`, so let's remove the other columns" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "real\t35m11.023s\n", "user\t56m9.951s\n", "sys\t2m37.521s\n" ] } ], "source": [ "!time gzcat $MY/wikidata-item-edges.tsv.gz | kgtk remove-columns -c 'id,rank,node2;magnitude,node2;unit,node2;date,node2;item,node2;lower,node2;upper,node2;latitude,node2;longitude,node2;precision,node2;calendar,node2;entity-type' \\\n", " | gzip > $MY/wikidata-item-edges-only.tsv.gz" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\n", "Q8\tP31\tQ331769\n", "Q8\tP31\tQ60539479\n", "Q8\tP31\tQ9415\n", "Q8\tP1343\tQ20743760\n", "Q8\tP1343\tQ1970746\n", "Q8\tP1343\tQ19180675\n", "Q8\tP461\tQ169251\n", "Q8\tP279\tQ16748867\n", "Q8\tP460\tQ935526\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-item-edges-only.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat $MY/wikidata-item-edges-only.tsv.gz | head" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "!gunzip $MY/wikidata-item-edges-only.tsv.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `kgtk graph-statistics` command will compute pagerank. It will run out of memory on a laptop with 16GB of memory." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/sh: line 1: 89795 Killed: 9 kgtk graph-statistics --directed --degrees --pagerank --log $MY/log.txt -i $MY/wikidata-item-edges-only.tsv > $MY/wikidata-pagerank-degrees.tsv\n", "\n", "real\t32m57.832s\n", "user\t19m47.624s\n", "sys\t8m58.352s\n" ] } ], "source": [ "!time kgtk graph_statistics --directed --degrees --pagerank --log $MY/log.txt -i $MY/wikidata-item-edges-only.tsv > $MY/wikidata-pagerank-degrees.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We ran it on a server with 256GM of memory. It used 50GB and produced the following files:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".\u001b[1;33mr\u001b[31mw\u001b[0m\u001b[38;5;244m-------\u001b[0m \u001b[1;32m735\u001b[0m\u001b[32mM\u001b[0m \u001b[1;33mpedroszekely\u001b[0m \u001b[34m 4 Jun 16:21\u001b[0m \u001b[36m/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/\u001b[31mwikidata-in-degree-only-sorted.tsv.gz\u001b[0m\n", ".\u001b[1;33mr\u001b[31mw\u001b[0m\u001b[38;5;244m-------\u001b[0m \u001b[1;32m764\u001b[0m\u001b[32mM\u001b[0m \u001b[1;33mpedroszekely\u001b[0m \u001b[34m 4 Jun 16:19\u001b[0m \u001b[36m/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/\u001b[31mwikidata-out-degree-only-sorted.tsv.gz\u001b[0m\n", ".\u001b[1;33mr\u001b[31mw\u001b[0m\u001b[38;5;244m-------\u001b[0m@ \u001b[1;32m928\u001b[0m\u001b[32mM\u001b[0m \u001b[1;33mpedroszekely\u001b[0m \u001b[34m 5 Jun 0:21\u001b[0m \u001b[36m/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/\u001b[31mwikidata-pagerank-only-sorted.tsv.gz\u001b[0m\n" ] } ], "source": [ "!exa -l \"$WD\"/*sorted*" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tproperty\tnode2\tid\n", "Q13442814\tvertex_pagerank\t0.02422254325848587\tQ13442814-vertex_pagerank-881612\n", "Q1860\tvertex_pagerank\t0.00842243515354162\tQ1860-vertex_pagerank-140\n", "Q5\tvertex_pagerank\t0.0073505352600377934\tQ5-vertex_pagerank-188\n", "Q5633421\tvertex_pagerank\t0.005898322426631837\tQ5633421-vertex_pagerank-101732\n", "Q21502402\tvertex_pagerank\t0.005796874633668408\tQ21502402-vertex_pagerank-4838249\n", "Q54812269\tvertex_pagerank\t0.005117345954282296\tQ54812269-vertex_pagerank-4838258\n", "Q1264450\tvertex_pagerank\t0.004881314896960181\tQ1264450-vertex_pagerank-18326\n", "Q602358\tvertex_pagerank\t0.004546331287981006\tQ602358-vertex_pagerank-587\n", "Q53869507\tvertex_pagerank\t0.0038679964665001417\tQ53869507-vertex_pagerank-3160055\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/wikidata-pagerank-only-sorted.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat \"$WD/wikidata-pagerank-only-sorted.tsv.gz\" | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, the `graph_statistics` command is not using standard column naming, using `property` instead of `label`.\n", "This will be fixed, for now, let's rename the columns." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "!kgtk rename-col -i \"$WD/wikidata-pagerank-only-sorted.tsv.gz\" --mode NONE --output-columns node1 label node2 id | gzip > $MY/wikidata-pagerank-only-sorted.tsv.gz" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tid\n", "Q13442814\tvertex_pagerank\t0.02422254325848587\tQ13442814-vertex_pagerank-881612\n", "Q1860\tvertex_pagerank\t0.00842243515354162\tQ1860-vertex_pagerank-140\n", "Q5\tvertex_pagerank\t0.0073505352600377934\tQ5-vertex_pagerank-188\n", "Q5633421\tvertex_pagerank\t0.005898322426631837\tQ5633421-vertex_pagerank-101732\n", "Q21502402\tvertex_pagerank\t0.005796874633668408\tQ21502402-vertex_pagerank-4838249\n", "Q54812269\tvertex_pagerank\t0.005117345954282296\tQ54812269-vertex_pagerank-4838258\n", "Q1264450\tvertex_pagerank\t0.004881314896960181\tQ1264450-vertex_pagerank-18326\n", "Q602358\tvertex_pagerank\t0.004546331287981006\tQ602358-vertex_pagerank-587\n", "Q53869507\tvertex_pagerank\t0.0038679964665001417\tQ53869507-vertex_pagerank-3160055\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-pagerank-only-sorted.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat $MY/wikidata-pagerank-only-sorted.tsv.gz | head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put the labels on the entity labels as columns so that we can read what is what. To do that, we concatenate the pagerank file with the labels file, and then ask kgtk to lift the labels into new columns." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "real\t10m55.396s\n", "user\t16m15.752s\n", "sys\t0m17.351s\n" ] } ], "source": [ "!time kgtk cat -i \"$MY/wikidata_labels.tsv\" $MY/pagerank.tsv | gzip > $MY/pagerank-and-labels.tsv.gz" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "real\t32m37.811s\n", "user\t11m5.594s\n", "sys\t10m30.283s\n" ] } ], "source": [ "!time kgtk lift -i $MY/pagerank-and-labels.tsv.gz | gzip > \"$WD/wikidata-pagerank-en.tsv.gz\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at the labels. Here are the top 20 pagerank items in Wikidata:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tid\tnode1;label\tlabel;label\tnode2;label\n", "Q13442814\tvertex_pagerank\t0.02422254325848587\tQ13442814-vertex_pagerank-881612\t'scholarly article'@en\t\t\n", "Q1860\tvertex_pagerank\t0.00842243515354162\tQ1860-vertex_pagerank-140\t'English'@en\t\t\n", "Q5\tvertex_pagerank\t0.0073505352600377934\tQ5-vertex_pagerank-188\t'human'@en\t\t\n", "Q5633421\tvertex_pagerank\t0.005898322426631837\tQ5633421-vertex_pagerank-101732\t'scientific journal'@en\t\t\n", "Q21502402\tvertex_pagerank\t0.005796874633668408\tQ21502402-vertex_pagerank-4838249\t'property constraint'@en\t\t\n", "Q54812269\tvertex_pagerank\t0.005117345954282296\tQ54812269-vertex_pagerank-4838258\t'WikibaseQualityConstraints'@en\t\t\n", "Q1264450\tvertex_pagerank\t0.004881314896960181\tQ1264450-vertex_pagerank-18326\t'J2000.0'@en\t\t\n", "Q602358\tvertex_pagerank\t0.004546331287981006\tQ602358-vertex_pagerank-587\t'Brockhaus and Efron Encyclopedic Dictionary'@en\t\t\n", "Q53869507\tvertex_pagerank\t0.0038679964665001417\tQ53869507-vertex_pagerank-3160055\t'property scope constraint'@en\t\t\n", "Q30\tvertex_pagerank\t0.003722615192558219\tQ30-vertex_pagerank-53\t'United States of America'@en\t\t\n", "Q2657718\tvertex_pagerank\t0.0036754039394037105\tQ2657718-vertex_pagerank-2969\t'Armenian Soviet Encyclopedia'@en\t\t\n", "Q21503250\tvertex_pagerank\t0.0036258228083834655\tQ21503250-vertex_pagerank-1652825\t'type constraint'@en\t\t\n", "Q19902884\tvertex_pagerank\t0.003403993346207395\tQ19902884-vertex_pagerank-4843313\t'Wikidata property definition'@en\t\t\n", "Q6581097\tvertex_pagerank\t0.0030890199307556172\tQ6581097-vertex_pagerank-128\t'male'@en\t\t\n", "Q21510865\tvertex_pagerank\t0.0029815432838705648\tQ21510865-vertex_pagerank-1652828\t'value type constraint'@en\t\t\n", "P2302\tvertex_pagerank\t0.0028243647567065384\tP2302-vertex_pagerank-20767739\t'property constraint'@en\t\t\n", "Q16521\tvertex_pagerank\t0.0028099172909745035\tQ16521-vertex_pagerank-794\t'taxon'@en\t\t\n", "Q21502838\tvertex_pagerank\t0.0027485333861137183\tQ21502838-vertex_pagerank-1652816\t'conflicts-with constraint'@en\t\t\n", "Q19652\tvertex_pagerank\t0.0026895742122130316\tQ19652-vertex_pagerank-3428\t'public domain'@en\t\t\n", "gzcat: error writing to output: Broken pipe\n", "gzcat: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/wikidata-pagerank-en.tsv.gz: uncompress failed\n" ] } ], "source": [ "!gzcat \"$WD/wikidata-pagerank-en.tsv.gz\" | head -20" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }