# Calculating Pagerank on Wikidata

In [23]:
import numpy as np
import pandas as pd
import os

In [24]:
%env MY=/Users/pedroszekely/data/wikidata-20200504
%env WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504

env: MY=/Users/pedroszekely/data/wikidata-20200504
env: WD=/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504


We need to filter the wikidata edge file to remove all edges where `node2` is a literal. 
We can do this by running `ifexists` to keep edges where `node2` also appears in `node1`.
This takes 2-3 hours on a laptop.

In [4]:
!time gzcat "$WD/wikidata_edges_20200504.tsv.gz" \
 | kgtk ifexists --filter-on "$WD/wikidata_edges_20200504.tsv.gz" --input-keys node2 --filter-keys node1 \
 | gzip > "$MY/wikidata-item-edges.tsv.gz"


real	121m58.689s
user	129m53.195s
sys	6m21.092s


In [5]:
!gzcat $MY/wikidata-item-edges.tsv.gz | wc

 460763981 3225347876 32869769062


We have 460 million edges that connect items to other items, let's make sure this is what we want before spending a lot of time computing pagerank

In [6]:
!gzcat $MY/wikidata-item-edges.tsv.gz | head

id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type
Q8-P31-1	Q8	P31	Q331769	normal				Q331769							item
Q8-P31-2	Q8	P31	Q60539479	normal				Q60539479							item
Q8-P31-3	Q8	P31	Q9415	normal				Q9415							item
Q8-P1343-1	Q8	P1343	Q20743760	normal				Q20743760							item
Q8-P1343-2	Q8	P1343	Q1970746	normal				Q1970746							item
Q8-P1343-3	Q8	P1343	Q19180675	normal				Q19180675							item
Q8-P461-1	Q8	P461	Q169251	normal				Q169251							item
Q8-P279-1	Q8	P279	Q16748867	normal				Q16748867							item
Q8-P460-1	Q8	P460	Q935526	normal				Q935526							item
gzcat: error writing to output: Broken pipe
gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-item-edges.tsv.gz: uncompress failed


Let's do a sanity check to make sure that we have the edges that we want.
We can do this by counting how many edges of each `entity-type`. 
Good news, we only have items and properties.

In [7]:
!time gzcat $MY/wikidata-item-edges.tsv.gz | kgtk unique $MY/wikidata-item-edges.tsv.gz --column 'node2;entity-type'

node1	label	node2
item	count	460737401
property	count	26579
gzcat: error writing to output: Broken pipe
gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-item-edges.tsv.gz: uncompress failed

real	21m44.450s
user	21m29.078s
sys	0m7.958s


We only needd `node`, `label` and `node2`, so let's remove the other columns

In [8]:
!time gzcat $MY/wikidata-item-edges.tsv.gz | kgtk remove-columns -c 'id,rank,node2;magnitude,node2;unit,node2;date,node2;item,node2;lower,node2;upper,node2;latitude,node2;longitude,node2;precision,node2;calendar,node2;entity-type' \
 | gzip > $MY/wikidata-item-edges-only.tsv.gz


real	35m11.023s
user	56m9.951s
sys	2m37.521s


In [9]:
!gzcat $MY/wikidata-item-edges-only.tsv.gz | head

node1	label	node2
Q8	P31	Q331769
Q8	P31	Q60539479
Q8	P31	Q9415
Q8	P1343	Q20743760
Q8	P1343	Q1970746
Q8	P1343	Q19180675
Q8	P461	Q169251
Q8	P279	Q16748867
Q8	P460	Q935526
gzcat: error writing to output: Broken pipe
gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-item-edges-only.tsv.gz: uncompress failed


In [44]:
!gunzip $MY/wikidata-item-edges-only.tsv.gz

The `kgtk graph-statistics` command will compute pagerank. It will run out of memory on a laptop with 16GB of memory.

In [12]:
!time kgtk graph_statistics --directed --degrees --pagerank --log $MY/log.txt -i $MY/wikidata-item-edges-only.tsv > $MY/wikidata-pagerank-degrees.tsv

/bin/sh: line 1: 89795 Killed: 9 kgtk graph-statistics --directed --degrees --pagerank --log $MY/log.txt -i $MY/wikidata-item-edges-only.tsv > $MY/wikidata-pagerank-degrees.tsv

real	32m57.832s
user	19m47.624s
sys	8m58.352s


We ran it on a server with 256GM of memory. It used 50GB and produced the following files:

In [21]:
!exa -l "$WD"/*sorted*

.[1;33mr[31mw[0m[38;5;244m-------[0m [1;32m735[0m[32mM[0m [1;33mpedroszekely[0m [34m 4 Jun 16:21[0m [36m/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/[31mwikidata-in-degree-only-sorted.tsv.gz[0m
.[1;33mr[31mw[0m[38;5;244m-------[0m [1;32m764[0m[32mM[0m [1;33mpedroszekely[0m [34m 4 Jun 16:19[0m [36m/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/[31mwikidata-out-degree-only-sorted.tsv.gz[0m
.[1;33mr[31mw[0m[38;5;244m-------[0m@ [1;32m928[0m[32mM[0m [1;33mpedroszekely[0m [34m 5 Jun 0:21[0m [36m/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/[31mwikidata-pagerank-only-sorted.tsv.gz[0m


In [26]:
!gzcat "$WD/wikidata-pagerank-only-sorted.tsv.gz" | head

node1	property	node2	id
Q13442814	vertex_pagerank	0.02422254325848587	Q13442814-vertex_pagerank-881612
Q1860	vertex_pagerank	0.00842243515354162	Q1860-vertex_pagerank-140
Q5	vertex_pagerank	0.0073505352600377934	Q5-vertex_pagerank-188
Q5633421	vertex_pagerank	0.005898322426631837	Q5633421-vertex_pagerank-101732
Q21502402	vertex_pagerank	0.005796874633668408	Q21502402-vertex_pagerank-4838249
Q54812269	vertex_pagerank	0.005117345954282296	Q54812269-vertex_pagerank-4838258
Q1264450	vertex_pagerank	0.004881314896960181	Q1264450-vertex_pagerank-18326
Q602358	vertex_pagerank	0.004546331287981006	Q602358-vertex_pagerank-587
Q53869507	vertex_pagerank	0.0038679964665001417	Q53869507-vertex_pagerank-3160055
gzcat: error writing to output: Broken pipe
gzcat: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200504/wikidata-pagerank-only-sorted.tsv.gz: uncompress failed


Oh, the `graph_statistics` command is not using standard column naming, using `property` instead of `label`.
This will be fixed, for now, let's rename the columns.

In [14]:
!kgtk rename-col -i "$WD/wikidata-pagerank-only-sorted.tsv.gz" --mode NONE --output-columns node1 label node2 id | gzip > $MY/wikidata-pagerank-only-sorted.tsv.gz

In [16]:
!gzcat $MY/wikidata-pagerank-only-sorted.tsv.gz | head

node1	label	node2	id
Q13442814	vertex_pagerank	0.02422254325848587	Q13442814-vertex_pagerank-881612
Q1860	vertex_pagerank	0.00842243515354162	Q1860-vertex_pagerank-140
Q5	vertex_pagerank	0.0073505352600377934	Q5-vertex_pagerank-188
Q5633421	vertex_pagerank	0.005898322426631837	Q5633421-vertex_pagerank-101732
Q21502402	vertex_pagerank	0.005796874633668408	Q21502402-vertex_pagerank-4838249
Q54812269	vertex_pagerank	0.005117345954282296	Q54812269-vertex_pagerank-4838258
Q1264450	vertex_pagerank	0.004881314896960181	Q1264450-vertex_pagerank-18326
Q602358	vertex_pagerank	0.004546331287981006	Q602358-vertex_pagerank-587
Q53869507	vertex_pagerank	0.0038679964665001417	Q53869507-vertex_pagerank-3160055
gzcat: error writing to output: Broken pipe
gzcat: /Users/pedroszekely/data/wikidata-20200504/wikidata-pagerank-only-sorted.tsv.gz: uncompress failed


Let's put the labels on the entity labels as columns so that we can read what is what. To do that, we concatenate the pagerank file with the labels file, and then ask kgtk to lift the labels into new columns.

In [17]:
!time kgtk cat -i "$MY/wikidata_labels.tsv" $MY/pagerank.tsv | gzip > $MY/pagerank-and-labels.tsv.gz


real	10m55.396s
user	16m15.752s
sys	0m17.351s


In [18]:
!time kgtk lift -i $MY/pagerank-and-labels.tsv.gz | gzip > "$WD/wikidata-pagerank-en.tsv.gz"


real	32m37.811s
user	11m5.594s
sys	10m30.283s


Now we can look at the labels. Here are the top 20 pagerank items in Wikidata:

In [28]:
!gzcat "$WD/wikidata-pagerank-en.tsv.gz" | head -20

node1	label	node2	id	node1;label	label;label	node2;label
Q13442814	vertex_pagerank	0.02422254325848587	Q13442814-vertex_pagerank-881612	'scholarly article'@en		
Q1860	vertex_pagerank	0.00842243515354162	Q1860-vertex_pagerank-140	'English'@en		
Q5	vertex_pagerank	0.0073505352600377934	Q5-vertex_pagerank-188	'human'@en		
Q5633421	vertex_pagerank	0.005898322426631837	Q5633421-vertex_pagerank-101732	'scientific journal'@en		
Q21502402	vertex_pagerank	0.005796874633668408	Q21502402-vertex_pagerank-4838249	'property constraint'@en		
Q54812269	vertex_pagerank	0.005117345954282296	Q54812269-vertex_pagerank-4838258	'WikibaseQualityConstraints'@en		
Q1264450	vertex_pagerank	0.004881314896960181	Q1264450-vertex_pagerank-18326	'J2000.0'@en		
Q602358	vertex_pagerank	0.004546331287981006	Q602358-vertex_pagerank-587	'Brockhaus and Efron Encyclopedic Dictionary'@en		
Q53869507	vertex_pagerank	0.0038679964665001417	Q53869507-vertex_pagerank-3160055	'property scope constraint'@en		
Q30	vertex_pagerank	0