{
"cells": [
{
"cell_type": "markdown",
"id": "statutory-onion",
"metadata": {},
"source": [
"# Understanding Removed Statements Dataset\n",
"\n",
"Source of data: [GDrive | Removed Stataments of Wikidata | Feb 1 2021](https://drive.google.com/file/d/1TQP1rADdvhDjsvBpLzSE9Bx3n73wf-Md/view?usp=sharing)\n",
"\n",
"Steps performed:\n",
"* Divide dataset into 2 halves - redirected and non-redirected. Redirected dataset has either node1 or node2 as redirected. But non-redirected has both node1, node2 not redirected\n",
"\n",
"\n",
"**Summary**\n",
"\n",
"Removed Statements dataset has 76.5M removed statements. Out of these, "
]
},
{
"cell_type": "markdown",
"id": "christian-mounting",
"metadata": {},
"source": [
"## Redirects determination and division of dataset into 2 halves\n",
"\n",
"* Since, redirects dataset was not present, a SPARQL query was run to determine all the redirects existing at the moment. This was done on Feb 19, 2021. This was executed using [SPARQL query](https://query.wikidata.org/). Query run was:\n",
" ```\n",
" SELECT ?old_node\n",
" WHERE {\n",
" ?old_node owl:sameAs ?new_node.\n",
" }\n",
" ```\n",
"* This has few lexemes as well which we don't need. So, I then ran the query:\n",
" ```\n",
" SELECT ?old_node\n",
" WHERE {\n",
" ?old_node owl:sameAs ?new_node.\n",
" ?new_node rdf:type ontolex:LexicalEntry.\n",
" }\n",
" ```\n",
"* After removing the lexemes from the nodes file, a final redirected non-lexemes file was created with data from Feb 19, 2021: `data/SPARQL_redirects_non-lexemes.tsv`.\n",
"* Using this reduced dataset, I was able to determine in the removed_statements.tsv dataset, which nodes have been redirected - `../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`. This has removed statements in which either node1 or node2 is redirected.\n",
"* After this, I am extracting the removed statements not present in this subset meaning it would correspond to all removed statements in neither node1 nor node2 is redirected - `../opAnalysis/removed_statements_both_nonredirects.tsv`\n",
"\n",
"For this, I am using the following set of commands"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "thick-absorption",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "boolean-string",
"metadata": {},
"outputs": [],
"source": [
"# On the basis of SPARQL\n",
"!kgtk ifexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_redirects_basis_node1.tsv\n",
"!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_basis_node1.tsv\n",
"!kgtk ifexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_redirects_basis_node2.tsv\n",
"!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_basis_node2.tsv\n",
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_redirects_basis_node1.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n",
" -o ../../opAnalysis/temp1.tsv\n",
"!kgtk cat -i ../../opAnalysis/temp1.tsv \\\n",
" ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n",
" -o ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv\n",
"!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects.tsv"
]
},
{
"cell_type": "markdown",
"id": "committed-volunteer",
"metadata": {},
"source": [
"## P31 edges distribution"
]
},
{
"cell_type": "markdown",
"id": "objective-range",
"metadata": {},
"source": [
"Now, we'll determine in this redirected dataset - `../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`, how many of these are P31 edges and determine more stats on these"
]
},
{
"cell_type": "markdown",
"id": "final-fraud",
"metadata": {},
"source": [
"### For Redirected Removed Statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "analyzed-silicon",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --return 'b, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_redirects_P31_stats1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "smaller-eugene",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Q4167836 \n",
" 526207 \n",
" 0.213808 \n",
" \n",
" \n",
" Q17329259 \n",
" 301359 \n",
" 0.122448 \n",
" \n",
" \n",
" Q5 \n",
" 222809 \n",
" 0.090531 \n",
" \n",
" \n",
" Q4167410 \n",
" 108583 \n",
" 0.044119 \n",
" \n",
" \n",
" Q13442814 \n",
" 101156 \n",
" 0.041102 \n",
" \n",
" \n",
" Q7187 \n",
" 88231 \n",
" 0.035850 \n",
" \n",
" \n",
" Q11266439 \n",
" 61007 \n",
" 0.024788 \n",
" \n",
" \n",
" Q4423781 \n",
" 53671 \n",
" 0.021808 \n",
" \n",
" \n",
" Q17143521 \n",
" 51581 \n",
" 0.020958 \n",
" \n",
" \n",
" Q15917122 \n",
" 50642 \n",
" 0.020577 \n",
" \n",
" \n",
" Q486972 \n",
" 49257 \n",
" 0.020014 \n",
" \n",
" \n",
" Q16521 \n",
" 46522 \n",
" 0.018903 \n",
" \n",
" \n",
" Q318 \n",
" 26722 \n",
" 0.010858 \n",
" \n",
" \n",
" Q532 \n",
" 23721 \n",
" 0.009638 \n",
" \n",
" \n",
" Q20900710 \n",
" 23482 \n",
" 0.009541 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"Q4167836 526207 0.213808\n",
"Q17329259 301359 0.122448\n",
"Q5 222809 0.090531\n",
"Q4167410 108583 0.044119\n",
"Q13442814 101156 0.041102\n",
"Q7187 88231 0.035850\n",
"Q11266439 61007 0.024788\n",
"Q4423781 53671 0.021808\n",
"Q17143521 51581 0.020958\n",
"Q15917122 50642 0.020577\n",
"Q486972 49257 0.020014\n",
"Q16521 46522 0.018903\n",
"Q318 26722 0.010858\n",
"Q532 23721 0.009638\n",
"Q20900710 23482 0.009541"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_P31_stats1.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "suburban-cosmetic",
"metadata": {},
"source": [
"### For non-redirected removed statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "characteristic-still",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --return 'b, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "subsequent-dutch",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Q4167836 \n",
" 368888 \n",
" 0.102453 \n",
" \n",
" \n",
" Q4167410 \n",
" 132403 \n",
" 0.036773 \n",
" \n",
" \n",
" Q5 \n",
" 130252 \n",
" 0.036176 \n",
" \n",
" \n",
" Q571 \n",
" 126883 \n",
" 0.035240 \n",
" \n",
" \n",
" Q11266439 \n",
" 125824 \n",
" 0.034946 \n",
" \n",
" \n",
" Q838948 \n",
" 119928 \n",
" 0.033308 \n",
" \n",
" \n",
" Q486972 \n",
" 108105 \n",
" 0.030025 \n",
" \n",
" \n",
" Q532 \n",
" 106786 \n",
" 0.029658 \n",
" \n",
" \n",
" Q783794 \n",
" 101121 \n",
" 0.028085 \n",
" \n",
" \n",
" Q1539532 \n",
" 78186 \n",
" 0.021715 \n",
" \n",
" \n",
" Q916333 \n",
" 62789 \n",
" 0.017439 \n",
" \n",
" \n",
" Q16521 \n",
" 53402 \n",
" 0.014832 \n",
" \n",
" \n",
" Q7366 \n",
" 45005 \n",
" 0.012499 \n",
" \n",
" \n",
" Q13406463 \n",
" 42582 \n",
" 0.011827 \n",
" \n",
" \n",
" Q18593264 \n",
" 40505 \n",
" 0.011250 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"Q4167836 368888 0.102453\n",
"Q4167410 132403 0.036773\n",
"Q5 130252 0.036176\n",
"Q571 126883 0.035240\n",
"Q11266439 125824 0.034946\n",
"Q838948 119928 0.033308\n",
"Q486972 108105 0.030025\n",
"Q532 106786 0.029658\n",
"Q783794 101121 0.028085\n",
"Q1539532 78186 0.021715\n",
"Q916333 62789 0.017439\n",
"Q16521 53402 0.014832\n",
"Q7366 45005 0.012499\n",
"Q13406463 42582 0.011827\n",
"Q18593264 40505 0.011250"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "whole-influence",
"metadata": {},
"source": [
"## Properties Distribution"
]
},
{
"cell_type": "markdown",
"id": "international-conditioning",
"metadata": {},
"source": [
"### For redirected removed statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "known-moore",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n",
" --match 'o: (a)-[r]->(b)' \\\n",
" --return 'r.label, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_redirects_props_dist.tsv"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "unlikely-default",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" P31 \n",
" 2381072 \n",
" 0.234921 \n",
" \n",
" \n",
" P17 \n",
" 357286 \n",
" 0.035251 \n",
" \n",
" \n",
" P1433 \n",
" 299464 \n",
" 0.029546 \n",
" \n",
" \n",
" P735 \n",
" 295778 \n",
" 0.029182 \n",
" \n",
" \n",
" P50 \n",
" 268412 \n",
" 0.026482 \n",
" \n",
" \n",
" P2860 \n",
" 243607 \n",
" 0.024035 \n",
" \n",
" \n",
" P625 \n",
" 227779 \n",
" 0.022473 \n",
" \n",
" \n",
" P106 \n",
" 185184 \n",
" 0.018271 \n",
" \n",
" \n",
" P131 \n",
" 183759 \n",
" 0.018130 \n",
" \n",
" \n",
" P21 \n",
" 179069 \n",
" 0.017667 \n",
" \n",
" \n",
" P921 \n",
" 167723 \n",
" 0.016548 \n",
" \n",
" \n",
" P279 \n",
" 162394 \n",
" 0.016022 \n",
" \n",
" \n",
" P1566 \n",
" 160213 \n",
" 0.015807 \n",
" \n",
" \n",
" P684 \n",
" 152695 \n",
" 0.015065 \n",
" \n",
" \n",
" P703 \n",
" 119182 \n",
" 0.011759 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"P31 2381072 0.234921\n",
"P17 357286 0.035251\n",
"P1433 299464 0.029546\n",
"P735 295778 0.029182\n",
"P50 268412 0.026482\n",
"P2860 243607 0.024035\n",
"P625 227779 0.022473\n",
"P106 185184 0.018271\n",
"P131 183759 0.018130\n",
"P21 179069 0.017667\n",
"P921 167723 0.016548\n",
"P279 162394 0.016022\n",
"P1566 160213 0.015807\n",
"P684 152695 0.015065\n",
"P703 119182 0.011759"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_props_dist.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "satisfactory-future",
"metadata": {},
"source": [
"### For non-redirected removed statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "seasonal-composite",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" --match 'o: (a)-[r]->(b)' \\\n",
" --return 'r.label, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_props_dist.tsv"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "straight-haiti",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" P2093 \n",
" 6173393 \n",
" 0.161314 \n",
" \n",
" \n",
" P1476 \n",
" 4238487 \n",
" 0.110754 \n",
" \n",
" \n",
" P31 \n",
" 3327644 \n",
" 0.086953 \n",
" \n",
" \n",
" P569 \n",
" 2011539 \n",
" 0.052563 \n",
" \n",
" \n",
" P625 \n",
" 1494410 \n",
" 0.039050 \n",
" \n",
" \n",
" P577 \n",
" 1116328 \n",
" 0.029170 \n",
" \n",
" \n",
" P234 \n",
" 999522 \n",
" 0.026118 \n",
" \n",
" \n",
" P570 \n",
" 983201 \n",
" 0.025692 \n",
" \n",
" \n",
" P131 \n",
" 927413 \n",
" 0.024234 \n",
" \n",
" \n",
" P364 \n",
" 870224 \n",
" 0.022739 \n",
" \n",
" \n",
" P2044 \n",
" 780870 \n",
" 0.020405 \n",
" \n",
" \n",
" P279 \n",
" 765112 \n",
" 0.019993 \n",
" \n",
" \n",
" P969 \n",
" 732461 \n",
" 0.019140 \n",
" \n",
" \n",
" P356 \n",
" 413439 \n",
" 0.010803 \n",
" \n",
" \n",
" P637 \n",
" 387091 \n",
" 0.010115 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"P2093 6173393 0.161314\n",
"P1476 4238487 0.110754\n",
"P31 3327644 0.086953\n",
"P569 2011539 0.052563\n",
"P625 1494410 0.039050\n",
"P577 1116328 0.029170\n",
"P234 999522 0.026118\n",
"P570 983201 0.025692\n",
"P131 927413 0.024234\n",
"P364 870224 0.022739\n",
"P2044 780870 0.020405\n",
"P279 765112 0.019993\n",
"P969 732461 0.019140\n",
"P356 413439 0.010803\n",
"P637 387091 0.010115"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_props_dist.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "martial-friday",
"metadata": {},
"source": [
"# Comparison Removed NR dataset with Qnodes, literals\n",
"\n",
"First, let's split this dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "higher-photograph",
"metadata": {},
"outputs": [],
"source": [
"from dateutil.parser import parse\n",
"import re\n",
"import rltk\n",
"from rltk.similarity import levenshtein_distance as ld\n",
"from nltk.tokenize import word_tokenize as wt\n",
"\n",
"def is_num(string):\n",
" try: \n",
" float(string)\n",
" return True\n",
"\n",
" except ValueError:\n",
" return False\n",
"\n",
"f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects.tsv\",\"r\").read().split(\"\\n\")\n",
"fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv\",\"w\")\n",
"fDat = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_date.tsv\",\"w\")\n",
"fQnd = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv\",\"w\")\n",
"fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv\",\"w\")\n",
"fnonQnd = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_lit.tsv\",\"w\")\n",
"\n",
"fStr.write(f1[0]+\"\\n\")\n",
"fDat.write(f1[0]+\"\\n\")\n",
"fQnd.write(f1[0]+\"\\n\")\n",
"fNum.write(f1[0]+\"\\n\")\n",
"fnonQnd.write(f1[0]+\"\\n\")\n",
"\n",
"for i in range(1,len(f1)):\n",
" val1 = f1[i].split(\"\\t\")[3]\n",
" if val1.startswith('Q'):\n",
" fQnd.write(f1[i]+\"\\n\")\n",
"# elif bool(re.search(\"\\^\\d{11}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z\\/\\d{,2}\",val1)):\n",
" elif val1.startswith(\"^\"):\n",
" fDat.write(f1[i]+\"\\n\")\n",
" fnonQnd.write(f1[i]+\"\\n\")\n",
" elif is_num(val1):\n",
" fNum.write(f1[i]+\"\\n\")\n",
" fnonQnd.write(f1[i]+\"\\n\")\n",
" else:\n",
" fStr.write(f1[i]+\"\\n\")\n",
" fnonQnd.write(f1[i]+\"\\n\")\n",
"\n",
"fQnd.close()\n",
"fDat.close()\n",
"fNum.close()\n",
"fStr.close()\n",
"fnonQnd.close()"
]
},
{
"cell_type": "markdown",
"id": "rough-emerald",
"metadata": {},
"source": [
"### String Comparison"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "amateur-effort",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n",
" ../gdrive-kgtk-dump-2020-12-07/claims.string.tsv.gz \\\n",
" --match \"r: (x)-[r]->(y), c: (x)-[s]->(z)\" \\\n",
" --where \"r.label = s.label\" \\\n",
" --return 'x, r.label, y, s.label as newNode2Label, z as newNode2' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n",
" --graph-cache ~/temp2.sqlite3.db"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "separate-georgia",
"metadata": {},
"outputs": [],
"source": [
"!sed -i '1s/.*/node1\\tlabel\\tnode2\\tnode2;newLabl\\tnode2;nw/' removed_statements_both_nonredirects_str_new_vals.tsv"
]
},
{
"cell_type": "markdown",
"id": "disturbed-geology",
"metadata": {},
"source": [
"The strings subset has a branching factor of approx 10. i.e. 1 removed statement with string literal has been replaced by around 10 new statements (with same node1-label combination). Doing the same comparisons won't give us much insights. Instead, let's truncate this dataset while retaining just the counts of branching factor from each of these node1-label combinations. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "fancy-photographer",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"14091663 ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "downtown-alabama",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n",
" --match \"(node1)-[r]->(node2{newLabl: newLabel, nw: newValue})\" \\\n",
" --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv \\\n",
" --graph-cache ~/sqlite3_caches/temptrunc.sqlite3.db"
]
},
{
"cell_type": "markdown",
"id": "tropical-cooperation",
"metadata": {},
"source": [
"On this truncated dataset, we will next compute the stats and comparisons. Note: Our original string literals subset of removed statements was around 9 GB. With the join operation with claims, this had increased to 90 GB. We have now truncated this dataset to 778 MB"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "successful-singer",
"metadata": {},
"outputs": [],
"source": [
"from dateutil.parser import parse\n",
"import re\n",
"import rltk\n",
"from rltk.similarity import levenshtein_distance as ld\n",
"from nltk.tokenize import word_tokenize as wt\n",
"from tqdm.notebook import tqdm\n",
"\n",
"f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv\",\"r\")\n",
"fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv\",\"w\")\n",
"\n",
"firstLine = next(f1).rstrip()\n",
"\n",
"fStr.write(firstLine+\"\\tVersionBool\\tRangeBool\\tLevDist\\tRearranged\\tRearrangedFirstNP\\n\")\n",
"\n",
"for line in tqdm(f1):\n",
" val1 = line.split(\"\\t\")[2]\n",
" val2 = line.split(\"\\t\")[4]\n",
" val2 = val2[1:-1]\n",
" versionBool = bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",val1))\n",
" rangeBool = bool(re.fullmatch(\"[\\d]+[-|–][\\d]+\",val1))\n",
" LevDist = ld(val1,val2)\n",
" rearranged = set(wt(val1)) == set(wt(val2))\n",
" rearrangedFirstNP = set(wt(val1)) == set(wt(val2[1:]))\n",
" fStr.write(line+ \"\\t\" + str(versionBool) + \"\\t\" + str(rangeBool) + \"\\t\" + \\\n",
" str(LevDist) + \"\\t\" + str(rearranged) + \"\\t\" + str(rearrangedFirstNP) + \"\\n\")\n",
"\n",
"fStr.close()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "international-violation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"list index out of range\r\n"
]
}
],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys label node1 \\\n",
" --filter-keys label node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_string_unmatched.tsv"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "tracked-carroll",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1923347844 ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "vocational-pound",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"14091661 ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "trained-tuning",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"16922586 ../../opAnalysis/removed_statements_both_nonredirects_node2_string_unmatched.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_string_unmatched.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "daily-complexity",
"metadata": {},
"outputs": [],
"source": [
"str_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "otherwise-bones",
"metadata": {},
"outputs": [],
"source": [
"str_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "restricted-locking",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"136.48837958054622"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df['node2;branching'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "hundred-entrepreneur",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 2084163\n",
"2 1739757\n",
"3 1645943\n",
"4 1530528\n",
"5 1209068\n",
" ... \n",
"12813 2\n",
"12840 1\n",
"13554 1\n",
"18192 1\n",
"29360 1\n",
"Name: node2;branching, Length: 2191, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df['node2;branching'].value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "secret-contest",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14091660"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df['node2;branching'].value_counts().sum()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "editorial-romance",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 14091660 updates, 25167 correspond to changes due to version change with average branching factor: 5.597170898398697\n"
]
}
],
"source": [
"print(f\"Out of {len(str_df)} updates, {str_df['VersionBool'].sum()} correspond to changes due to version change with average branching factor: {str_df[str_df['VersionBool'] == True]['node2;branching'].mean()}\")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "social-plenty",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 25167.000000\n",
"mean 4.792625\n",
"std 6.162759\n",
"min 0.000000\n",
"25% 1.000000\n",
"50% 2.000000\n",
"75% 5.000000\n",
"max 63.000000\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df[str_df['VersionBool'] == True].LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "promising-hopkins",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 14091660 updates, 321952 correspond to changes due to range change with average branching factor: 1.0656495378193023\n"
]
}
],
"source": [
"print(f\"Out of {len(str_df)} updates, {str_df['RangeBool'].sum()} correspond to changes due to range change with average branching factor: {str_df[str_df['RangeBool'] == True]['node2;branching'].mean()}\")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "varied-reform",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"count 321952.000000\n",
"mean 2.343707\n",
"std 2.188651\n",
"min 0.000000\n",
"25% 1.000000\n",
"50% 2.000000\n",
"75% 3.000000\n",
"max 47.000000\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df[str_df['RangeBool'] == True].LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "annoying-transaction",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 14091660 updates, 229782 correspond to changes due to rearrangement with average branching factor: 3.5381753139932632\n"
]
}
],
"source": [
"print(f\"Out of {len(str_df)} updates, {str_df['Rearranged'].sum()} correspond to changes due to rearrangement with average branching factor: {str_df[str_df['Rearranged'] == True]['node2;branching'].mean()}\")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "three-characteristic",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 229782.000000\n",
"mean 2.934938\n",
"std 1.989685\n",
"min 0.000000\n",
"25% 1.000000\n",
"50% 4.000000\n",
"75% 4.000000\n",
"max 56.000000\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df[str_df['Rearranged'] == True].LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "military-coordinator",
"metadata": {},
"outputs": [],
"source": [
"str_df.LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "european-treat",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'count v/s Lev edit distances')"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEICAYAAABPgw/pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAaE0lEQVR4nO3df5zcVX3v8debhEDJYvgRu8YkkNCmai7RSrb8KLTuVtSASB73lvYmN0Wo0PRRGx/eKtUg3IhYW9GLFgGLuV4uV4hZkSKkNBJbZMu9F6GQKoRAgysEkwgJEggupIXUz/3je9Z8M53dmZ18d2dzfD8fj3lkvt9z9nw/c3bmPd85MztRRGBmZge+g9pdgJmZVcOBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhm+0nSZZJuStePkTQgacJ+jLdZ0unp+sckfbmqWi1vDnRrWTl49mOMxZK+OtbHHS0R8cOI6IiIfwOQ1Cfpwv0Y788jouHP7+9xLA8OdGu3dwNr212EWQ4c6JmQNFPSrZKelfScpGvS/oMkXSrpKUk7JH1F0pTU1i1pa8045Zf7l0m6Of3MTyRtlNSV2m4EjgH+Ji0xfKROTY9JOqu0PTHVd8JgbcA7gDslHSrpplT7C5IekNQ5wjk4SNJyST9I49ws6ajU9k1Jy2r6PyTpPw0x1smS7k21PCSpu9Q2W9I/pDn5O2BqqW2WpEi39VPAbwDXpDm6ZohjnZt+P89JuqSmrbycU3eOhjqOpKskbZH0oqT1kn6jZty6v9vUXvf+lNrel363z0taJ+nYtF+SPp/uZy9K2iDp+GF/aVatiGjbBbge2AE80kTfzwPfS5fHgRfaWft4ugATgIfSHE0GDgVOS23vA/qB44AO4FbgxtTWDWytGWszcHq6fhnwL8CZ6Rh/AdxXr+8Qda0AVpW23w08Vto+GfhOuv6HwN8Ah6VjzQdeM8S4dY8LfBC4D5gBHAJ8CVid2t4L/L9S37nAC8AhdcaZDjyXbvfgk85zwGtT+3eAz6Vj/CbwE+Cm1DYLCGBi2u4DLhxmjuYCA2mcQ9K4e2p+Bzc1mqN6xwF+DzgamAh8GHgGOLTR75bh708LKe5Pb0rjXgrcm9reBawHjgCU+kxr9+Pj5+nS3oMXd+ITaCLQa37uA8D17Z688XIBTgGeHQyRmra7gPeXtt8AvJoejN00DvS/L7XNBXbX6ztEXb+cwu6wtL0KWFFq/yTw39L19wH3Am9u4vbWPS7wGPD20va00m09HHgJODa1fWqo+xDwUdKTXmnfOuA8ilcle4DJpbav0nqgrwB6S9uTgVeoH+hDzlGj46Q+zwNvafS7bXB/+iZwQWn7IOBl4FjgtyhOtk4GDmr34+Ln8dLWJZeIuAfYWd4n6Zck3ZleIv4fSW+s86OLgdVjUuSBYSbwVETsqdP2euCp0vZTFAHX7HLGM6XrLwOHSprYzA9GRD9FyL5H0mHA2RThN+hM9q6f30gRmr2SfiTpM5IObrLGQccC30jLES+kY/8b0BkRPwH+FliU+i6meIIZapzfGRwnjXUaxRPE64HnI+KlUv+n6ozRrNcDWwY30rjPDdF3RHMk6aK0NLIr3YYplJaHGPp3O9z96VjgqtK87KQ4G58eEd8GrgGuBXZIWinpNcPdeKvWeFxDXwl8ICLmAxcBXyw3pvW62cC321DbeLUFOGaIoP0RxYNw0OAZ5naKM9bDBhtUfNTutSM4bjPfvbyaIjwXAo+mkEfS6ygC8p8AIuLViPhERMwFfh04i2KZZCS2AGdExBGly6ERsa1ci6RTKJYR7h5mnBtrxpkcEZ8GngaOlDS51P+YYWpqNEdPUwQoAOmJ7+i6Aw0/R/scJ62XfwT4XeDIiDgC2EURvo0Md3/aAvxhzdz8QkTcm2r8QnrszgV+BfjTJo5nFRlXgS6pg+KO+nVJ36NYA51W020RcEukj4UZAP9IEQyfljQ5vXl2ampbDfxJeiOvA/hz4Gvp7OtxirOyd6czvUsp1nGbtZ1ibX44vcA7gT9i37PzM4A7I4rX7ZJ6JM1LTyovUiyV/HSYcQ9Ot3PwMhG4DvhU6U2610paWPqZtRRPbpdTzMFQ499E8ariXZImpPG7Jc2IiKeAB4FPSJok6TTgPcPU2WiObgHOknSapEmptrqPywZzVHucwymeuJ8FJkpaATR7tjzc/ek64GJJ/yHVNEXS76TrvybppHRfeolijX6436FVbFwFOkU9L0TEr5Yub6rpswgvt+wjPbm9h2LN+ofAVuA/p+brKV6q3wM8SfEg+0D6uV3A+4EvA9soHoT7fOqlgb8ALk0vvy8aoranKd5E/HXga6Wm2o8rvo4i3F6kWCr5h1T3UNYCu0uXy4CrgDXAtyT9hOIN0pNKtfwrxZvCp7Pvk0ttzVsoXlF8jCIQt1CcaQ4+Xv5LGncn8HHgK8PUeRVwTvpEyBfqHGsj8Mepnqcp1rmH+h0MN0e1x1kH3EnxpP0Uxe99y78bsY7h7k8R8Q3gCoplnxeBRyienKF4wvgf6TY8RbF09NlmjmnVUDpBal8B0izgjog4Pm3fC3w+Ir4uSRRvAD2U2t5IcSedHe0u3FqWzqafAY6LiBfbXY9ZLtp6hi5pNcXZ2xskbZV0AbAEuEDSQ8BGijOlQYsoPhHgMD+wHUXx6RaHuVmF2n6GbmZm1Rhva+hmZtaipj5PPBqmTp0as2bNaulnX3rpJSZPnty44zhwoNTqOqvlOqvlOvdav379jyOi/seL2/UXTfPnz49W3X333S3/7Fg7UGp1ndVyndVynXsBD8Z4/EtRMzOrjgPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDQNd0vXpv5R6pEG/X5O0R9I51ZVnZmbNauYM/QZgwXAd0td5XgF8q4KazMysBQ0DPer8r0J1fAD4a4r/H9TMzNqgqS/nqv2K25q26RTf5dxD8d3bd0TELUOMsxRYCtDZ2Tm/t7e3paJ37NzF9t312+ZNn9LSmKNlYGCAjo6OdpfRkOusluusluvcq6enZ31EdNVrq+K7XP4S+GhE/LT4+vKhRcRKiv9ijq6uruju7m7pgFevup0rN9QvffOS1sYcLX19fbR6O8eS66yW66yW62xOFYHeRfG/l0DxH9CeKWlPRNxWwdhmZtak/Q70iJg9eF3SDRRLLrft77hmZjYyDQM9/a9C3cBUSVsp/g/FgwEi4rpRrc7MzJrWMNAjYnGzg0XE+ftVjZmZtcx/KWpmlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlomGgS7pekk7JD0yRPsSSQ9L2iDpXklvqb5MMzNrpJkz9BuABcO0Pwm8LSLmAZ8EVlZQl5mZjdDERh0i4h5Js4Zpv7e0eR8wo4K6zMxshBQRjTsVgX5HRBzfoN9FwBsj4sIh2pcCSwE6Ozvn9/b2jrhggB07d7F9d/22edOntDTmaBkYGKCjo6PdZTTkOqvlOqvlOvfq6elZHxFd9doanqE3S1IPcAFw2lB9ImIlaUmmq6sruru7WzrW1atu58oN9UvfvKS1MUdLX18frd7OseQ6q+U6q+U6m1NJoEt6M/Bl4IyIeK6KMc3MbGT2+2OLko4BbgXOjYjH978kMzNrRcMzdEmrgW5gqqStwMeBgwEi4jpgBXA08EVJAHuGWt8xM7PR08ynXBY3aL8QqPsmqJmZjR3/paiZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpaJhoEu6XpJOyQ9MkS7JH1BUr+khyWdUH2ZZmbWSDNn6DcAC4ZpPwOYky5Lgb/a/7LMzGykGgZ6RNwD7Bymy0LgK1G4DzhC0rSqCjQzs+YoIhp3kmYBd0TE8XXa7gA+HRH/N23fBXw0Ih6s03cpxVk8nZ2d83t7e1sqesfOXWzfXb9t3vQpLY05WgYGBujo6Gh3GQ25zmq5zmq5zr16enrWR0RXvbaJo3rkGhGxElgJ0NXVFd3d3S2Nc/Wq27lyQ/3SNy9pbczR0tfXR6u3cyy5zmq5zmq5zuZU8SmXbcDM0vaMtM/MzMZQFYG+Bnhv+rTLycCuiHi6gnHNzGwEGi65SFoNdANTJW0FPg4cDBAR1wFrgTOBfuBl4PdHq1gzMxtaw0CPiMUN2gP448oqMjOzlvgvRc3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMtFUoEtaIGmTpH5Jy+u0HyPpbknflfSwpDOrL9XMzIbTMNAlTQCuBc4A5gKLJc2t6XYpcHNEvBVYBHyx6kLNzGx4zZyhnwj0R8QTEfEK0AssrOkTwGvS9SnAj6or0czMmqGIGL6DdA6wICIuTNvnAidFxLJSn2nAt4AjgcnA6RGxvs5YS4GlAJ2dnfN7e3tbKnrHzl1s312/bd70KS2NOVoGBgbo6OhodxkNuc5quc5quc69enp61kdEV722iRUdYzFwQ0RcKekU4EZJx0fET8udImIlsBKgq6sruru7WzrY1atu58oN9UvfvKS1MUdLX18frd7OseQ6q+U6q+U6m9PMkss2YGZpe0baV3YBcDNARHwHOBSYWkWBZmbWnGYC/QFgjqTZkiZRvOm5pqbPD4G3A0h6E0WgP1tloWZmNryGgR4Re4BlwDrgMYpPs2yUdLmks1O3DwN/IOkhYDVwfjRanDczs0o1tYYeEWuBtTX7VpSuPwqcWm1pZmY2Ev5LUTOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBNNBbqkBZI2SeqXtHyIPr8r6VFJGyV9tdoyzcyskYmNOkiaAFwLvAPYCjwgaU1EPFrqMwe4GDg1Ip6X9IujVbCZmdXXzBn6iUB/RDwREa8AvcDCmj5/AFwbEc8DRMSOass0M7NGFBHDd5DOARZExIVp+1zgpIhYVupzG/A4cCowAbgsIu6sM9ZSYClAZ2fn/N7e3paK3rFzF9t312+bN31KS2OOloGBATo6OtpdRkOus1qus1quc6+enp71EdFVr63hkkuTJgJzgG5gBnCPpHkR8UK5U0SsBFYCdHV1RXd3d0sHu3rV7Vy5oX7pm5e0NuZo6evro9XbOZZcZ7VcZ7VcZ3OaWXLZBswsbc9I+8q2Amsi4tWIeJLibH1ONSWamVkzmgn0B4A5kmZLmgQsAtbU9LmN4uwcSVOBXwGeqK5MMzNrpGGgR8QeYBmwDngMuDkiNkq6XNLZqds64DlJjwJ3A38aEc+NVtFmZvbvNbWGHhFrgbU1+1aUrgfwoXQxM7M28F+KmpllwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5lloqlAl7RA0iZJ/ZKWD9PvtyWFpK7qSjQzs2Y0DHRJE4BrgTOAucBiSXPr9Dsc+CBwf9VFmplZY82coZ8I9EfEExHxCtALLKzT75PAFcC/VFifmZk1SRExfAfpHGBBRFyYts8FToqIZaU+JwCXRMRvS+oDLoqIB+uMtRRYCtDZ2Tm/t7e3paJ37NzF9t312+ZNn9LSmKNlYGCAjo6OdpfRkOusluusluvcq6enZ31E1F3Wnri/g0s6CPgccH6jvhGxElgJ0NXVFd3d3S0d8+pVt3Plhvqlb17S2pijpa+vj1Zv51hyndVyndVync1pZsllGzCztD0j7Rt0OHA80CdpM3AysMZvjJqZja1mAv0BYI6k2ZImAYuANYONEbErIqZGxKyImAXcB5xdb8nFzMxGT8NAj4g9wDJgHfAYcHNEbJR0uaSzR7tAMzNrTlNr6BGxFlhbs2/FEH27978sMzMbKf+lqJlZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlommAl3SAkmbJPVLWl6n/UOSHpX0sKS7JB1bfalmZjachoEuaQJwLXAGMBdYLGluTbfvAl0R8WbgFuAzVRdqZmbDa+YM/USgPyKeiIhXgF5gYblDRNwdES+nzfuAGdWWaWZmjSgihu8gnQMsiIgL0/a5wEkRsWyI/tcAz0TEn9VpWwosBejs7Jzf29vbUtE7du5i++76bfOmT2lpzNEyMDBAR0dHu8toyHVWy3VWy3Xu1dPTsz4iuuq1TazyQJJ+D+gC3lavPSJWAisBurq6oru7u6XjXL3qdq7cUL/0zUtaG3O09PX10ertHEuus1qus1qusznNBPo2YGZpe0batw9JpwOXAG+LiH+tpjwzM2tWM2voDwBzJM2WNAlYBKwpd5D0VuBLwNkRsaP6Ms3MrJGGgR4Re4BlwDrgMeDmiNgo6XJJZ6dunwU6gK9L+p6kNUMMZ2Zmo6SpNfSIWAusrdm3onT99IrrMjOzEfJfipqZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZaKpQJe0QNImSf2SltdpP0TS11L7/ZJmVV6pmZkNq2GgS5oAXAucAcwFFkuaW9PtAuD5iPhl4PPAFVUXamZmw5vYRJ8Tgf6IeAJAUi+wEHi01GchcFm6fgtwjSRFRFRYa1NmLf/buvs3f/rdY1yJmdnYaibQpwNbSttbgZOG6hMReyTtAo4GflzuJGkpsDRtDkja1ErRwNTasRtR+14zjLjWNnGd1XKd1XKdex07VEMzgV6ZiFgJrNzfcSQ9GBFdFZQ06g6UWl1ntVxntVxnc5p5U3QbMLO0PSPtq9tH0kRgCvBcFQWamVlzmgn0B4A5kmZLmgQsAtbU9FkDnJeunwN8ux3r52ZmP88aLrmkNfFlwDpgAnB9RGyUdDnwYESsAf4ncKOkfmAnReiPpv1ethlDB0qtrrNarrNarrMJ8om0mVke/JeiZmaZcKCbmWXigAv0Rl9DMMa1zJR0t6RHJW2U9MG0/yhJfyfp++nfI9N+SfpCqv1hSSeMcb0TJH1X0h1pe3b6qob+9NUNk9L+tn2Vg6QjJN0i6Z8lPSbplPE4n5L+JP3OH5G0WtKh42U+JV0vaYekR0r7RjyHks5L/b8v6bx6xxqFOj+bfvcPS/qGpCNKbRenOjdJeldp/6hmQr06S20flhSSpqbtts0nABFxwFwo3pT9AXAcMAl4CJjbxnqmASek64cDj1N8PcJngOVp/3LginT9TOCbgICTgfvHuN4PAV8F7kjbNwOL0vXrgD9K198PXJeuLwK+NoY1/m/gwnR9EnDEeJtPij+kexL4hdI8nj9e5hP4TeAE4JHSvhHNIXAU8ET698h0/cgxqPOdwMR0/YpSnXPT4/0QYHbKgQljkQn16kz7Z1J8WOQpYGq75zMiDrhAPwVYV9q+GLi43XWV6rkdeAewCZiW9k0DNqXrXwIWl/r/rN8Y1DYDuAv4LeCOdIf7cenB87O5TXfSU9L1iamfxqDGKSkoVbN/XM0ne/8y+qg0P3cA7xpP8wnMqgnKEc0hsBj4Umn/Pv1Gq86atv8IrErX93msD87pWGVCvTopvubkLcBm9gZ6W+fzQFtyqfc1BNPbVMs+0svotwL3A50R8XRqegboTNfbWf9fAh8Bfpq2jwZeiIg9dWrZ56scgMGvchhts4Fngf+Vloa+LGky42w+I2Ib8N+BHwJPU8zPesbffJaNdA7Hw2PtfRRnuwxTT1vqlLQQ2BYRD9U0tbXOAy3QxyVJHcBfA/81Il4st0XxdNzWz4ZKOgvYERHr21lHEyZSvLT9q4h4K/ASxfLAz4yT+TyS4gvpZgOvByYDC9pZ00iMhzlsRNIlwB5gVbtrqSXpMOBjwIp211LrQAv0Zr6GYExJOpgizFdFxK1p93ZJ01L7NGBH2t+u+k8Fzpa0GeilWHa5CjhCxVc11NbSrq9y2ApsjYj70/YtFAE/3ubzdODJiHg2Il4FbqWY4/E2n2UjncO2PdYknQ+cBSxJTz4MU0876vwliifzh9JjagbwT5Je1+46D7RAb+ZrCMaMJFH8lexjEfG5UlP5qxDOo1hbH9z/3vRO+MnArtLL4FETERdHxIyImEUxZ9+OiCXA3RRf1VCvzjH/KoeIeAbYIukNadfbKb6meVzNJ8VSy8mSDkv3gcE6x9V81hjpHK4D3inpyPSK5J1p36iStIBiafDsiHi5pv5F6RNDs4E5wD/ShkyIiA0R8YsRMSs9prZSfDjiGdo9n1Uvyo/2heJd5Mcp3tm+pM21nEbx0vVh4HvpcibF+uhdwPeBvweOSv1F8Z+F/ADYAHS1oeZu9n7K5TiKB0U/8HXgkLT/0LTdn9qPG8P6fhV4MM3pbRSfCBh38wl8Avhn4BHgRopPX4yL+QRWU6ztv0oRNhe0MocUa9j96fL7Y1RnP8Va8+Dj6bpS/0tSnZuAM0r7RzUT6tVZ076ZvW+Ktm0+I8J/+m9mlosDbcnFzMyG4EA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBP/HxyZ/1kdA/yIAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"str_df.LevDist.hist(bins=50).set_title(\"count v/s Lev edit distances\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "quarterly-shock",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'count v/s Lev edit distances till 20')"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEICAYAAABPgw/pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAc/klEQVR4nO3dfZRcVZnv8e/PhJeZNAYwTosJEND4EkEZ0gIKavcSMQRN5s4wXiIiDGCGq3HpnQGNg4PIOAq6wKWIMpFhRSGkeRleciEauDP0cB2MQ6JACBkwIEhiSJRAxwYUgs/94+zOFJV663rPye+zVq0+p/beZz+9z6mnTu2qU6WIwMzMdn6v6HQAZmbWHE7oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07otsuSdL6kq9PyAZJGJI1rYHuPSTo2Lf+dpCuaFWs3S+N2cFpeJOlLablf0vrORrdrcULfBRUmnga2MVfSNe3ut1Ui4pcR0RMRLwFIGpJ0ZgPb+3JEVG3faD/tVireNG6PjnE7fyJpiaRfSRqW9B+Sjiyq82FJj0t6VtLNkvZtxv+QZ07oVq8TgGWdDsJ2Wj3APcAMYF/ge8BtknoAJL0F+CfgFKAXeA74dmdC3YlEhG8dvAH7AzcCvwaeAr6V7n8F8HngcWAz8H1gYirrB9YXbecx4Ni0fD5wXWrzW2AN0JfKrgL+ADwPjACfKRHTWuADBevjU3yHF8S2CZgE7AlcnWJ/huxB2lvmf90eY9H9rwAWAI+k7VwH7JvKfgDML6p/H/DnZfo4Crg7xXIf0F9QdhDw72lM7gC+BVydyqYCkf7XfwReAn6XxuhbZfo6Je2fp4BzS+yD0W2XHKNy/QDfAJ4AtgKrgHcV9Fl231Y6nlLZ6WnfPg0sBw5M9wv4OtlxthVYDRxS4v8tF28Ar0/Li4AvlTtOqzwWtgIz0vKXgWsKyl4HvADs1enHbDffOts5XJkOogdqrP8h4MF0EF/Tytja9P+PS0nn68CE9MA/JpWdDqwDDiY7m7kRuCqV7fBAKZFMfgfMSn18BVhRqm6ZuM4DFhesnwCsLVg/CvhxWv5r4P8Af5z6mgG8ssx2S/YLfApYAUwB9iA7M1uSyj4K/EdB3elkSXGPEtuZnJLYLLInifel9Ven8h8Dl6Q+3k2WEHdI6Gl9CDizwhhNT0nt3Wl7lwDbKJ3Qy45RqX6AjwCvInty+VvgSWDPavuWysfTHLLj6c1pu58H7k5l7yd74tibLLm/GdivzP9dKt6GEzpwWPq/Jqb1W4DPFtUZISV830rfOj3lsgiYWUtFSdOAzwFHR8RbgE+3Lqy2OQJ4LXBORDwbEb+LiB+lspOBSyLi0YgYIfvfT5I0vsZt/ygilkU2J3wV8LYxxHUNMFvSH6f1DwNLCsoLp1teJEs+r4+IlyJiVURsHUNfAGcB50bE+oj4PVnSOjH9rzcBh0k6MNU9Gbgx1Sv2EWBZ+r//EBF3ACuBWZIOAN4O/H1E/D4i7iJLsvU6Ebg1Iu5Ksfw92SufUsY0RhFxdUQ8FRHbIuJisieMNxZUKbdvKx1PZwFfiYi1EbGN7Ax4dFxfBPYC3gQo1dk49iGpj6RXpv/jixExnO7uAYaLqg6nOK2Mjib09KDaUnifpNdJ+qGkVZL+n6Q3paKPAZdFxNOp7eY2h9sK+wOPpwdYsdeSvZwf9TjZmVVvjdt+smD5OWDPWp8MImId2UvzD6akPpssyY+axX8n9KvIXr4Ppje4vipptxpjHHUgcJOkZyQ9k/p+iWzq5rfAbcBJqe5cYHGF7fzl6HbSto4B9iMbz6cj4tmC+o+X2EatXks2LQJA2u5TZeqOaYwknS1pbXqz8BlgItn01qhy+7bS8XQg8I2CcdlCdjY+OSL+jWz66TJgs6SFKcm2nKQ/IntiXRERXykoGgGKY3gl2asqK6PTZ+ilLAQ+GREzgLP57zdC3gC8Ib0bvkJSTWf2Xe4J4IAyifZXZA/CUQeQvaTfBDxL9vIdgPRRu1ePod9avjN5CVnynAM8mJI8kl5DliB/ChARL0bEFyNiOvBO4ANk0yRj8QRwfETsXXDbMyI2FMYi6R1k0wh3VtjOVUXbmRARFwIbgX0kTSiof0CFmKqN0UayBApAeuJ7VckNVR6jl/Uj6V3AZ8imF/eJiL3JzkxVJR6ofDw9Afx10dj8UUTcnWL8ZnrMTSd7rJ1Tpo+mfd+2pD2Am4H1ZNNShdZQ8KoyfSxyD+DhZvWfR12V0NM73O8Erpd0L9lc6n6peDwwjWxebi7wXUl7tz/KpvpPssRwoaQJkvaUdHQqWwL8b0kHpXH5MnBtOvt6mOys7IR0pvd5soO9VpvI5uYrGQSOA/4XLz87Px74YUQ2qSlpQNKh6UllK9nL93JTDwC7pf9z9DYeuBz4x9FpFUmvljSnoM0ysie3C8jGoNz2ryZ7VfF+SePS9vslTYmIx8mmX74oaXdJxwAfrBBntTG6AfiApGMk7Z5iK/l4qjJGxf3sRfbE/WtgvKTz2PFMtZxKx9PlwOfSp0eQNFHSX6blt0s6Mh1Lz5LNZZcb41qOnapSXzeQvTl/aol9uphsX74rPQlfQDbV5jP0CroqoZPF80xEHFZwe3MqWw8sTWc7vyBLatM6FmkTpDnQDwKvB35J9j/+z1R8JdlL9buAX5A9yD6Z2g0DHweuADaQPQjHcgHHV4DPp5ffZ5eJbSPZm4jvBK4tKCr+uOJryB6YW8mmSv49xV3OMrIH8ejtfLJPdSwFbpf0W7I3SLd/JjnNUd8IHMvLn1yKY36C7BXF35ElxCfIzjRHj/MPp+1uAb5A9kmRcr5BNo//tKRvluhrDfCJFM9Gsk+OlNsHlcaouJ/lwA/Jju/Hyfb7EztssYRKx1NE3ARcRDbtsxV4gOzJGbInjO+m/2H0UztfK9NNxXEZg9FXKscBzyi7OGkkvUIZHd+zyBL7ZrInuo830N8uQelEq3MBSFPJ3lw6JK3fDXw9Iq6XJOCtEXFfmmKZGxGnSpoE/Aw4LCLKzVtak6Wz6SeBg+t449PMWqyjZ+iSlpCdBb5R0npJZ5B9iuEMSfeRzaONvvReDjwl6UGyOdRznMzbbl+yT4k4mZt1oY6foZuZWXN02xy6mZnVqdaLVJpu0qRJMXXq1LraPvvss0yYMKF6xTbr1rige2NzXGPjuMYmj3GtWrXqNxFR+mPKzbjctJ7bjBkzol533nln3W1bqVvjiuje2BzX2DiuscljXMDK6NJL/83MrEmc0M3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHLCCd3MLCec0M3McsIJ3cwsJzp26b+Z7Wjqgtu2Lz924QkdjMR2Rk7oZjnhJwPzlIuZWU44oZuZ5UTVhC7pSkmbJT1Qpd7bJW2TdGLzwjMzs1rVcoa+CJhZqUL6NfOLgNubEJOZmdWhakKPiLvIfiW9kk8C/0L269xmZtYBNf2mqKSpwK0RcUiJssnANcAAcGWqd0OZ7cwD5gH09vbOGBwcrCvokZERenp66mrbSt0aF3RvbI7r5VZvGN6+fOjkiTuUV4qrWttW8n4cm0biGhgYWBURfSULy/3yReENmAo8UKbseuCotLwIOLGWbfoXi9qrW2NzXC934Gdv3X4rpVJc1dq2kvfj2LTqF4ua8Tn0PmBQEsAkYJakbRFxcxO2bWZmNWo4oUfEQaPLkhaRTbnc3Oh2zcxsbKomdElLgH5gkqT1wBeA3QAi4vKWRmdmZjWrmtAjYm6tG4uI0xqKxszM6uYrRc3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHLCCd3MLCec0M3McsI/QWfWZP4pOOsUn6GbmeWEE7qZWU44oZuZ5YQTuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU74wiIze9nFUOALonZWPkM3M8sJJ3Qzs5xwQjczy4mqCV3SlZI2S3qgTPnJku6XtFrS3ZLe1vwwzcysmlrO0BcBMyuU/wJ4T0QcCvwDsLAJcZmZ2RhV/ZRLRNwlaWqF8rsLVlcAU5oQl5mZjZEionqlLKHfGhGHVKl3NvCmiDizTPk8YB5Ab2/vjMHBwTEHDDAyMkJPT09dbVupW+OC7o0tj3Gt3jC8ffnQyROb2rZSXM3qt572edyPrdRIXAMDA6sioq9UWdMSuqQB4NvAMRHxVLVt9vX1xcqVK6v2XcrQ0BD9/f11tW2lbo0Luje2PMbVyA9cVGtbKa5m9VtP+zzux1ZqJC5JZRN6Uy4skvRW4Arg+FqSuZmZNV/DH1uUdABwI3BKRDzceEhmnbd6wzBTF9y2w5mrWTereoYuaQnQD0yStB74ArAbQERcDpwHvAr4tiSAbeVeDpiZWevU8imXuVXKzwRKvglqZmbt4ytFzcxywgndzCwnnNDNzHLCCd3MLCec0M3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHLCCd3MLCec0M3McqIpX59r1m0a+W5ws52Vz9DNzHLCCd3MLCec0M3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHKiakKXdKWkzZIeKFMuSd+UtE7S/ZIOb36YZmZWTS1n6IuAmRXKjwempds84DuNh2VmZmNVNaFHxF3AlgpV5gDfj8wKYG9J+zUrQDMzq40ionolaSpwa0QcUqLsVuDCiPhRWv9X4LMRsbJE3XlkZ/H09vbOGBwcrCvokZERenp66mrbSt0aF3RvbK2Ka/WG4e3Lh06eOOb2m7cMs+n5+to30ne1tpXGq1n91tN+Vzu+GtVIXAMDA6sioq9UWVu/yyUiFgILAfr6+qK/v7+u7QwNDVFv21bq1rige2NrVVynFX6Xy8lj3/6li2/h4tXj62rfSN/V2lYar2b1W0/7Xe34alSr4mrGp1w2APsXrE9J95mZWRs1I6EvBT6aPu1yFDAcERubsF0zMxuDqlMukpYA/cAkSeuBLwC7AUTE5cAyYBawDngO+KtWBWtmZuVVTegRMbdKeQCfaFpEZmZWF18pamYNW71hmKkLbnvZD4tY+zmhm5nlhBO6mVlOOKGbmeWEfyTaupZ/6NlsbHyGbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTNSV0STMlPSRpnaQFJcoPkHSnpJ9Jul/SrOaHamZmlVT9gQtJ44DLgPcB64F7JC2NiAcLqn0euC4iviNpOrAMmNqCeG0n4x+pMGufWs7QjwDWRcSjEfECMAjMKaoTwCvT8kTgV80L0czMaqGIqFxBOhGYGRFnpvVTgCMjYn5Bnf2A24F9gAnAsRGxqsS25gHzAHp7e2cMDg7WFfTIyAg9PT11tW2lbo0LOhfb6g3D25cPnTxxh/JKcVVr20i/1WzeMsym59vfdzeMVz3tGxmvVurWx2QjcQ0MDKyKiL5SZc36TdG5wKKIuFjSO4CrJB0SEX8orBQRC4GFAH19fdHf319XZ0NDQ9TbtpW6NS7oXGynFU65nLxj/5Xiqta2kX6ruXTxLVy8enzb++6G8aqnfSPj1Urd+phsVVy1TLlsAPYvWJ+S7it0BnAdQET8GNgTmNSMAM3MrDa1JPR7gGmSDpK0O3ASsLSozi+B9wJIejNZQv91MwM1M7PKqib0iNgGzAeWA2vJPs2yRtIFkmanan8LfEzSfcAS4LSoNjlvZmZNVdMcekQsI/soYuF95xUsPwgc3dzQzMxsLHylqJlZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU44oZuZ5YQTuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU406/vQzczq4p8pbB6foZuZ5YQTuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU44oZuZ5URNCV3STEkPSVonaUGZOh+S9KCkNZKuaW6YZmZWTdULiySNAy4D3gesB+6RtDT9MPRonWnA54CjI+JpSX/SqoDNzKy0Ws7QjwDWRcSjEfECMAjMKarzMeCyiHgaICI2NzdMMzOrRhFRuYJ0IjAzIs5M66cAR0bE/II6NwMPA0cD44DzI+KHJbY1D5gH0NvbO2NwcLCuoEdGRujp6amrbSt1a1zQudhWbxjevnzo5Ik7lFeKq1rbRvqtZvOWYTY93/6+u2G86mnfqfGqplsfk43ENTAwsCoi+kqVNeu7XMYD04B+YApwl6RDI+KZwkoRsRBYCNDX1xf9/f11dTY0NES9bVupW+OCzsV2WuH3dJy8Y/+V4qrWtpF+q7l08S1cvHp82/vuhvGqp32nxquabn1MtiquWqZcNgD7F6xPSfcVWg8sjYgXI+IXZGfr05oTopmZ1aKWhH4PME3SQZJ2B04ClhbVuZns7BxJk4A3AI82L0wzM6umakKPiG3AfGA5sBa4LiLWSLpA0uxUbTnwlKQHgTuBcyLiqVYFbWZmO6ppDj0ilgHLiu47r2A5gL9JNzMz6wBfKWpmlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhPN+sUiM7O2m1r4a0cXntDBSLqDz9DNzHLCCd3MLCec0M3McsIJ3cwsJ5zQzcxywgndzCwnakrokmZKekjSOkkLKtT7C0khqa95IZqZWS2qJnRJ44DLgOOB6cBcSdNL1NsL+BTwk2YHaWZm1dVyYdERwLqIeBRA0iAwB3iwqN4/ABcB5zQ1Qus4X7xhtnNQRFSuIJ0IzIyIM9P6KcCRETG/oM7hwLkR8ReShoCzI2JliW3NA+YB9Pb2zhgcHKwr6JGREXp6eupq20rdGhc0FtvqDcPblw+dPLGpbSvF1cp+q9m8ZZhNz7e/724Yr3ra74zj1UmNxDUwMLAqIkpOazd86b+kVwCXAKdVqxsRC4GFAH19fdHf319Xn0NDQ9TbtpW6NS5oLLbTCs/QTx7bNqq1rRRXK/ut5tLFt3Dx6vFt77sbxque9jvjeHVSq+Kq5U3RDcD+BetT0n2j9gIOAYYkPQYcBSz1G6NmZu1VS0K/B5gm6SBJuwMnAUtHCyNiOCImRcTUiJgKrABml5pyMTOz1qma0CNiGzAfWA6sBa6LiDWSLpA0u9UBmplZbWqaQ4+IZcCyovvOK1O3v/GwzMxsrHylqJlZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU74R6LNbJeUx+8o8hm6mVlOOKGbmeWEE7qZWU44oZuZ5YQTuplZTuySn3LJ47vbZmY+QzczywkndDOznNglp1x2Rp4mMrNqfIZuZpYTTuhmZjnhhG5mlhNO6GZmOVHTm6KSZgLfAMYBV0TEhUXlfwOcCWwDfg2cHhGPNzlWM7Ou0Y0fVKh6hi5pHHAZcDwwHZgraXpRtZ8BfRHxVuAG4KvNDtTMzCqrZcrlCGBdRDwaES8Ag8CcwgoRcWdEPJdWVwBTmhummZlVo4ioXEE6EZgZEWem9VOAIyNifpn63wKejIgvlSibB8wD6O3tnTE4OFhX0CMjI/T09NTVFmD1huHty4dOnlj3doo1GlcljcbcSGyN9F2tbaW4WtlvNZu3DLPp+fb33Q3jVU/7XW28Gm3fyONxYGBgVUT0lSpr6oVFkj4C9AHvKVUeEQuBhQB9fX3R399fVz9DQ0PU2xbgtMK5r5Pr306xRuOqpNGYL118Cxf/6Nms/Rjn+xrpu1rbSmPWyn6ruXTxLVy8enzb++6G8aqn/a42Xo22b1WuqCWhbwD2L1ifku57GUnHAucC74mI3zcnPDMzq1Utc+j3ANMkHSRpd+AkYGlhBUl/CvwTMDsiNjc/TDMzq6ZqQo+IbcB8YDmwFrguItZIukDS7FTta0APcL2keyUtLbM5MzNrkZrm0CNiGbCs6L7zCpaPbXJcZmY2Rr5S1MwsJ/z1uWZmbVB4ZemimRNa0ocT+hh14+W+ZmbgKRczs9xwQjczywkndDOznHBCNzPLCb8p2kZ+Q9XMWsln6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmObFTJvTVG4aZuuC2l32u28xsV7dTJnQzM9uRE7qZWU44oZuZ5YQTuplZTtSU0CXNlPSQpHWSFpQo30PStan8J5KmNj1SMzOrqGpClzQOuAw4HpgOzJU0vajaGcDTEfF64OvARc0O1MzMKqvlDP0IYF1EPBoRLwCDwJyiOnOA76XlG4D3SlLzwjQzs2oUEZUrSCcCMyPizLR+CnBkRMwvqPNAqrM+rT+S6vymaFvzgHlp9Y3AQ3XGPQn4TdVa7detcUH3xua4xsZxjU0e4zowIl5dqqCtP3AREQuBhY1uR9LKiOhrQkhN1a1xQffG5rjGxnGNza4WVy1TLhuA/QvWp6T7StaRNB6YCDzVjADNzKw2tST0e4Bpkg6StDtwErC0qM5S4NS0fCLwb1FtLsfMzJqq6pRLRGyTNB9YDowDroyINZIuAFZGxFLgn4GrJK0DtpAl/VZqeNqmRbo1Luje2BzX2Diusdml4qr6pqiZme0cfKWomVlOOKGbmeVEVyf0bvzKAUn7S7pT0oOS1kj6VIk6/ZKGJd2bbue1Oq7U72OSVqc+V5Yol6RvpvG6X9LhbYjpjQXjcK+krZI+XVSnbeMl6UpJm9O1E6P37SvpDkk/T3/3KdP21FTn55JOLVWnyXF9TdJ/pX11k6S9y7StuN9bENf5kjYU7K9ZZdpWfPy2IK5rC2J6TNK9Zdq2ZLzK5Ya2Hl8R0ZU3sjdgHwEOBnYH7gOmF9X5OHB5Wj4JuLYNce0HHJ6W9wIeLhFXP3BrB8bsMWBShfJZwA8AAUcBP+nAPn2S7MKIjowX8G7gcOCBgvu+CixIywuAi0q02xd4NP3dJy3v0+K4jgPGp+WLSsVVy35vQVznA2fXsK8rPn6bHVdR+cXAee0cr3K5oZ3HVzefoXflVw5ExMaI+Gla/i2wFpjcyj6baA7w/cisAPaWtF8b+38v8EhEPN7GPl8mIu4i+yRWocLj6HvAn5Vo+n7gjojYEhFPA3cAM1sZV0TcHhHb0uoKsmtA2qrMeNWilsdvS+JKOeBDwJJm9VdjTOVyQ9uOr25O6JOBJwrW17Nj4txeJx34w8Cr2hIdkKZ4/hT4SYnid0i6T9IPJL2lTSEFcLukVcq+ZqFYLWPaSidR/kHWifEa1RsRG9Pyk0BviTqdHrvTyV5dlVJtv7fC/DQVdGWZKYROjte7gE0R8fMy5S0fr6Lc0Lbjq5sTeleT1AP8C/DpiNhaVPxTsmmFtwGXAje3KaxjIuJwsm/G/ISkd7ep36qUXZQ2G7i+RHGnxmsHkb3+7arP8ko6F9gGLC5Tpd37/TvA64DDgI1k0xvdZC6Vz85bOl6VckOrj69uTuhd+5UDknYj22GLI+LG4vKI2BoRI2l5GbCbpEmtjisiNqS/m4GbyF72FqplTFvleOCnEbGpuKBT41Vg0+jUU/q7uUSdjoydpNOADwAnp2Swgxr2e1NFxKaIeCki/gB8t0x/nRqv8cCfA9eWq9PK8SqTG9p2fHVzQu/KrxxI83P/DKyNiEvK1HnN6Fy+pCPIxrmlTzSSJkjaa3SZ7A21B4qqLQU+qsxRwHDBS8FWK3vW1InxKlJ4HJ0K3FKiznLgOEn7pCmG49J9LSNpJvAZYHZEPFemTi37vdlxFb7v8j/K9FfL47cVjgX+K9I3vxZr5XhVyA3tO76a/U5vk981nkX2TvEjwLnpvgvIDnCAPclewq8D/hM4uA0xHUP2kul+4N50mwWcBZyV6swH1pC9s78CeGcb4jo49Xdf6nt0vArjEtmPlTwCrAb62rQfJ5Al6IkF93VkvMieVDYCL5LNU55B9r7LvwI/B/4vsG+q2wdcUdD29HSsrQP+qg1xrSObVx09zkY/0fVaYFml/d7iuK5Kx8/9ZMlqv+K40voOj99WxpXuXzR6XBXUbct4VcgNbTu+fOm/mVlOdPOUi5mZjYETuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU44oZuZ5cT/B3ZNeWmC500qAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"str_df.LevDist[str_df.LevDist <= 20].hist(bins=100).set_title(\"count v/s Lev edit distances till 20\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "entire-candle",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"list index out of range\r\n"
]
}
],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 label \\\n",
" --filter-keys node1 label \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "similar-nevada",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"16922584 ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv"
]
},
{
"cell_type": "markdown",
"id": "administrative-barbados",
"metadata": {},
"source": [
"### Dates Comparison"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "creative-office",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-03-15 01:44:30 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_22_c1.\"node1\", graph_22_c1.\"label\", graph_22_c1.\"node2\", graph_24_c2.\"label\" \"_aLias.newNode2Label\", graph_24_c2.\"node2\" \"_aLias.newNode2\"\n",
" FROM graph_22 AS graph_22_c1, graph_24 AS graph_24_c2\n",
" WHERE graph_22_c1.\"node1\"=graph_24_c2.\"node1\"\n",
" AND (graph_22_c1.\"label\" = graph_24_c2.\"label\")\n",
" PARAS: []\n",
"---------------------------------------------\n",
"[2021-03-15 01:44:30 sqlstore]: CREATE INDEX on table graph_22 column node1 ...\n",
"[2021-03-15 01:44:33 sqlstore]: ANALYZE INDEX on table graph_22 column node1 ...\n",
"[2021-03-15 01:44:34 sqlstore]: CREATE INDEX on table graph_24 column node1 ...\n",
"[2021-03-15 01:45:08 sqlstore]: ANALYZE INDEX on table graph_24 column node1 ...\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_date.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/claims.time.tsv.gz \\\n",
" --match \"node2: (x)-[r]->(y), time: (x)-[s]->(z)\" \\\n",
" --where \"r.label = s.label\" \\\n",
" --return 'x, r.label, y, s.label as newNode2Label, z as newNode2' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_date_new_vals_rightone.tsv\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "sophisticated-glance",
"metadata": {},
"outputs": [],
"source": [
"# from dateutil.parser import parse\n",
"# import re\n",
"# import rltk\n",
"# from rltk.similarity import levenshtein_distance as ld\n",
"# from nltk.tokenize import word_tokenize as wt\n",
"# from tqdm.notebook import tqdm\n",
"\n",
"# f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date.tsv\",\"r\").read().split('\\n')\n",
"# fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date_measured.tsv\",\"w\")\n",
"\n",
"# firstLine = f1[0]\n",
"\n",
"# fStr.write(firstLine+\"\\tSameDate\\n\")\n",
"\n",
"# for i in tqdm(range(1, len(f1)-1)):\n",
"# line = f1[i]\n",
"# val1 = line.split(\"\\t\")[2]\n",
"# val2 = line.split(\"\\t\")[4]\n",
"# val2 = val2[1:-1]\n",
"# versionBool = bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",val1))\n",
"# rangeBool = bool(re.fullmatch(\"[\\d]+[-|–][\\d]+\",val1))\n",
"# LevDist = ld(val1,val2)\n",
"# rearranged = set(wt(val1)) == set(wt(val2))\n",
"# rearrangedFirstNP = set(wt(val1)) == set(wt(val2[1:]))\n",
"# fStr.write(line+ \"\\t\" + str(versionBool) + \"\\t\" + str(rangeBool) + \"\\t\" + \\\n",
"# str(LevDist) + \"\\t\" + str(rearranged) + \"\\t\" + str(rearrangedFirstNP) + \"\\n\")\n",
"\n",
"# fStr.close()"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "identified-calculation",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"date_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_date_new_vals_rightone.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "federal-cursor",
"metadata": {},
"outputs": [],
"source": [
"# date_df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "infinite-handbook",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" newNode2Label \n",
" newNode2 \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Q1004723 \n",
" P569 \n",
" ^00000001887-00-00T00:00:00Z/9 \n",
" P569 \n",
" ^1887-01-01T00:00:00Z/9 \n",
" \n",
" \n",
" 1 \n",
" Q102084 \n",
" P569 \n",
" ^00000001093-00-00T00:00:00Z/9 \n",
" P569 \n",
" ^1093-01-01T00:00:00Z/9 \n",
" \n",
" \n",
" 2 \n",
" Q10272460 \n",
" P570 \n",
" ^00000001917-00-00T00:00:00Z/9 \n",
" P570 \n",
" ^1919-03-06T00:00:00Z/11 \n",
" \n",
" \n",
" 3 \n",
" Q10289892 \n",
" P569 \n",
" ^00000001953-00-00T00:00:00Z/9 \n",
" P569 \n",
" ^1953-01-01T00:00:00Z/9 \n",
" \n",
" \n",
" 4 \n",
" Q1029352 \n",
" P569 \n",
" ^00000001893-00-00T00:00:00Z/9 \n",
" P569 \n",
" ^1893-01-20T00:00:00Z/11 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node1 label node2 newNode2Label \\\n",
"0 Q1004723 P569 ^00000001887-00-00T00:00:00Z/9 P569 \n",
"1 Q102084 P569 ^00000001093-00-00T00:00:00Z/9 P569 \n",
"2 Q10272460 P570 ^00000001917-00-00T00:00:00Z/9 P570 \n",
"3 Q10289892 P569 ^00000001953-00-00T00:00:00Z/9 P569 \n",
"4 Q1029352 P569 ^00000001893-00-00T00:00:00Z/9 P569 \n",
"\n",
" newNode2 \n",
"0 ^1887-01-01T00:00:00Z/9 \n",
"1 ^1093-01-01T00:00:00Z/9 \n",
"2 ^1919-03-06T00:00:00Z/11 \n",
"3 ^1953-01-01T00:00:00Z/9 \n",
"4 ^1893-01-20T00:00:00Z/11 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "established-joining",
"metadata": {},
"outputs": [],
"source": [
"def parseDate(str):\n",
"# try:\n",
" if str == '' or str == \" \": return []\n",
" elems = []\n",
" toFetchI = 1\n",
" dash1 = str.find(\"-\",toFetchI)\n",
" toFetchI = dash1 + 1\n",
" elems.append(int(str[:dash1]))\n",
"\n",
" dash2 = str.find(\"-\",toFetchI)\n",
" toFetchI = dash2 + 1\n",
" elems.append(int(str[dash1+1:dash2]))\n",
"\n",
" dashT = str.find(\"T\",toFetchI)\n",
" toFetchI = dashT + 1\n",
" elems.append(int(str[dash2+1:dashT]))\n",
"\n",
" dashC = str.find(\":\",toFetchI)\n",
" toFetchI = dashC + 1\n",
" elems.append(int(str[dashT+1:dashC]))\n",
"\n",
" dashC2 = str.find(\":\",toFetchI)\n",
" toFetchI = dashC2 + 1\n",
" elems.append(int(str[dashC+1:dashC2]))\n",
"\n",
" dashZ = str.find(\"Z\",toFetchI)\n",
" toFetchI = dashZ + 2\n",
" elems.append(int(str[dashC2+1:dashZ]))\n",
"\n",
" elems.append(int(str[toFetchI:]))\n",
" return elems\n",
"# except:\n",
"# print(str)\n",
"# return []\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "lucky-gossip",
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"def validateDate(elems):\n",
" if len(elems) == 0:\n",
" return False\n",
" precision = elems[-1]\n",
"# assert precision >= 9\n",
" elems = elems[:-1]\n",
"# if precision == 14: #second\n",
"# lastIndex = 6\n",
"# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n",
"# elif precision == 13: #minute\n",
"# lastIndex = 5\n",
"# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n",
"# elif precision == 12: #hour\n",
"# lastIndex = 4\n",
"# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n",
"# elif precision == 11: #day\n",
"# lastIndex = 3\n",
"# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n",
"# elif precision == 10: #month\n",
"# lastIndex = 2\n",
"# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n",
"# elif precision <= 9: #year\n",
"# lastIndex = 1\n",
"# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n",
" if elems[1] == 0: elems[1] = 1\n",
" if elems[2] == 0: elems[2] = 1\n",
" \n",
" if elems[0] < 1970 or elems[0] > 9999: \n",
" if elems[0] % 400 == 0 or (elems[0] % 4 == 0 and elems[0] % 100 != 0):\n",
" elems[0] = 1972\n",
" else:\n",
" elems[0] = 1970\n",
" if precision < 0 or precision > 14:\n",
" return False\n",
" try:\n",
" datetime.datetime(*elems)\n",
" return True\n",
" except:\n",
" return False\n",
" return status"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "executed-theater",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"validateDate(parseDate(\"1887-00-00T00:00:00Z/9\"))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "enormous-carpet",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime.datetime(1948, 2, 29, 0, 0, 0, 11)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"datetime.datetime(*[1948, 2, 29, 0, 0, 0, 11])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "complete-index",
"metadata": {},
"outputs": [],
"source": [
"date_df['parsed_date'] = date_df['node2'].apply(lambda x: parseDate(x[1:]))\n",
"date_df['parsed_date2'] = date_df['newNode2'].apply(lambda x: parseDate(x[1:]))\n",
"date_df['valid_date'] = date_df['node2'].apply(lambda x: validateDate(parseDate(x[1:])))\n",
"date_df['same_date'] = date_df.apply(lambda p: p.parsed_date == p.parsed_date2, axis=1)\n",
"date_df['str_same_date'] = date_df.apply(lambda p: p.node2 == p.newNode2, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "surface-warehouse",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4711733"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(date_df)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "diagnostic-satellite",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" newNode2Label \n",
" newNode2 \n",
" parsed_date \n",
" parsed_date2 \n",
" valid_date \n",
" same_date \n",
" str_same_date \n",
" \n",
" \n",
" \n",
" \n",
" 5950 \n",
" Q1285220 \n",
" P570 \n",
" ^00000001979-02-29T00:00:00Z/11 \n",
" P570 \n",
" ^1979-03-29T00:00:00Z/11 \n",
" [1979, 2, 29, 0, 0, 0, 11] \n",
" [1979, 3, 29, 0, 0, 0, 11] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 5973 \n",
" Q165823 \n",
" P569 \n",
" ^00000001900-02-29T00:00:00Z/11 \n",
" P569 \n",
" ^1900-03-13T00:00:00Z/11 \n",
" [1900, 2, 29, 0, 0, 0, 11] \n",
" [1900, 3, 13, 0, 0, 0, 11] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 6073 \n",
" Q481471 \n",
" P569 \n",
" ^00000001762-02-29T00:00:00Z/11 \n",
" P569 \n",
" ^1762-02-28T00:00:00Z/11 \n",
" [1762, 2, 29, 0, 0, 0, 11] \n",
" [1762, 2, 28, 0, 0, 0, 11] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 6233 \n",
" Q16097212 \n",
" P569 \n",
" ^00000001935-06-31T00:00:00Z/11 \n",
" P569 \n",
" ^1935-01-01T00:00:00Z/9 \n",
" [1935, 6, 31, 0, 0, 0, 11] \n",
" [1935, 1, 1, 0, 0, 0, 9] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 61707 \n",
" Q10717720 \n",
" P576 \n",
" ^00000001995-06-31T00:00:00Z/11 \n",
" P576 \n",
" ^1995-06-31T00:00:00Z/11 \n",
" [1995, 6, 31, 0, 0, 0, 11] \n",
" [1995, 6, 31, 0, 0, 0, 11] \n",
" False \n",
" True \n",
" False \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 4653389 \n",
" Q27267640 \n",
" P569 \n",
" ^1989-02-29T00:00:00Z/11 \n",
" P569 \n",
" ^1989-02-00T00:00:00Z/10 \n",
" [1989, 2, 29, 0, 0, 0, 11] \n",
" [1989, 2, 0, 0, 0, 0, 10] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 4674014 \n",
" Q2379398 \n",
" P569 \n",
" ^1518-04-31T00:00:00Z/11 \n",
" P569 \n",
" ^1518-05-01T00:00:00Z/11 \n",
" [1518, 4, 31, 0, 0, 0, 11] \n",
" [1518, 5, 1, 0, 0, 0, 11] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 4674015 \n",
" Q2379398 \n",
" P569 \n",
" ^1518-04-31T00:00:00Z/11 \n",
" P569 \n",
" ^1518-00-00T00:00:00Z/9 \n",
" [1518, 4, 31, 0, 0, 0, 11] \n",
" [1518, 0, 0, 0, 0, 0, 9] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 4679134 \n",
" Q10932215 \n",
" P569 \n",
" ^1938-02-30T00:00:00Z/11 \n",
" P569 \n",
" ^1938-02-00T00:00:00Z/10 \n",
" [1938, 2, 30, 0, 0, 0, 11] \n",
" [1938, 2, 0, 0, 0, 0, 10] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 4684514 \n",
" Q6447447 \n",
" P570 \n",
" ^1875-02-29T00:00:00Z/11 \n",
" P570 \n",
" ^1875-02-05T00:00:00Z/11 \n",
" [1875, 2, 29, 0, 0, 0, 11] \n",
" [1875, 2, 5, 0, 0, 0, 11] \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
"
\n",
"
186 rows × 10 columns
\n",
"
"
],
"text/plain": [
" node1 label node2 newNode2Label \\\n",
"5950 Q1285220 P570 ^00000001979-02-29T00:00:00Z/11 P570 \n",
"5973 Q165823 P569 ^00000001900-02-29T00:00:00Z/11 P569 \n",
"6073 Q481471 P569 ^00000001762-02-29T00:00:00Z/11 P569 \n",
"6233 Q16097212 P569 ^00000001935-06-31T00:00:00Z/11 P569 \n",
"61707 Q10717720 P576 ^00000001995-06-31T00:00:00Z/11 P576 \n",
"... ... ... ... ... \n",
"4653389 Q27267640 P569 ^1989-02-29T00:00:00Z/11 P569 \n",
"4674014 Q2379398 P569 ^1518-04-31T00:00:00Z/11 P569 \n",
"4674015 Q2379398 P569 ^1518-04-31T00:00:00Z/11 P569 \n",
"4679134 Q10932215 P569 ^1938-02-30T00:00:00Z/11 P569 \n",
"4684514 Q6447447 P570 ^1875-02-29T00:00:00Z/11 P570 \n",
"\n",
" newNode2 parsed_date \\\n",
"5950 ^1979-03-29T00:00:00Z/11 [1979, 2, 29, 0, 0, 0, 11] \n",
"5973 ^1900-03-13T00:00:00Z/11 [1900, 2, 29, 0, 0, 0, 11] \n",
"6073 ^1762-02-28T00:00:00Z/11 [1762, 2, 29, 0, 0, 0, 11] \n",
"6233 ^1935-01-01T00:00:00Z/9 [1935, 6, 31, 0, 0, 0, 11] \n",
"61707 ^1995-06-31T00:00:00Z/11 [1995, 6, 31, 0, 0, 0, 11] \n",
"... ... ... \n",
"4653389 ^1989-02-00T00:00:00Z/10 [1989, 2, 29, 0, 0, 0, 11] \n",
"4674014 ^1518-05-01T00:00:00Z/11 [1518, 4, 31, 0, 0, 0, 11] \n",
"4674015 ^1518-00-00T00:00:00Z/9 [1518, 4, 31, 0, 0, 0, 11] \n",
"4679134 ^1938-02-00T00:00:00Z/10 [1938, 2, 30, 0, 0, 0, 11] \n",
"4684514 ^1875-02-05T00:00:00Z/11 [1875, 2, 29, 0, 0, 0, 11] \n",
"\n",
" parsed_date2 valid_date same_date str_same_date \n",
"5950 [1979, 3, 29, 0, 0, 0, 11] False False False \n",
"5973 [1900, 3, 13, 0, 0, 0, 11] False False False \n",
"6073 [1762, 2, 28, 0, 0, 0, 11] False False False \n",
"6233 [1935, 1, 1, 0, 0, 0, 9] False False False \n",
"61707 [1995, 6, 31, 0, 0, 0, 11] False True False \n",
"... ... ... ... ... \n",
"4653389 [1989, 2, 0, 0, 0, 0, 10] False False False \n",
"4674014 [1518, 5, 1, 0, 0, 0, 11] False False False \n",
"4674015 [1518, 0, 0, 0, 0, 0, 9] False False False \n",
"4679134 [1938, 2, 0, 0, 0, 0, 10] False False False \n",
"4684514 [1875, 2, 5, 0, 0, 0, 11] False False False \n",
"\n",
"[186 rows x 10 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_df[date_df['valid_date'] == False]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "seventh-sister",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" newNode2Label \n",
" newNode2 \n",
" parsed_date \n",
" parsed_date2 \n",
" valid_date \n",
" same_date \n",
" str_same_date \n",
" \n",
" \n",
" \n",
" \n",
" 116 \n",
" Q12260242 \n",
" P569 \n",
" ^00000001964-00-00T00:00:00Z/9 \n",
" P569 \n",
" ^1964-00-00T00:00:00Z/9 \n",
" [1964, 0, 0, 0, 0, 0, 9] \n",
" [1964, 0, 0, 0, 0, 0, 9] \n",
" True \n",
" True \n",
" False \n",
" \n",
" \n",
" 134 \n",
" Q12352405 \n",
" P569 \n",
" ^00000001987-03-19T00:00:00Z/11 \n",
" P569 \n",
" ^1987-03-19T00:00:00Z/11 \n",
" [1987, 3, 19, 0, 0, 0, 11] \n",
" [1987, 3, 19, 0, 0, 0, 11] \n",
" True \n",
" True \n",
" False \n",
" \n",
" \n",
" 273 \n",
" Q16506839 \n",
" P569 \n",
" ^00000001718-01-01T00:00:00Z/9 \n",
" P569 \n",
" ^1718-01-01T00:00:00Z/9 \n",
" [1718, 1, 1, 0, 0, 0, 9] \n",
" [1718, 1, 1, 0, 0, 0, 9] \n",
" True \n",
" True \n",
" False \n",
" \n",
" \n",
" 291 \n",
" Q1686296 \n",
" P571 \n",
" ^00000002013-01-01T00:00:00Z/11 \n",
" P571 \n",
" ^2013-01-01T00:00:00Z/11 \n",
" [2013, 1, 1, 0, 0, 0, 11] \n",
" [2013, 1, 1, 0, 0, 0, 11] \n",
" True \n",
" True \n",
" False \n",
" \n",
" \n",
" 390 \n",
" Q258257 \n",
" P569 \n",
" ^00000001140-00-00T00:00:00Z/9 \n",
" P569 \n",
" ^1140-00-00T00:00:00Z/9 \n",
" [1140, 0, 0, 0, 0, 0, 9] \n",
" [1140, 0, 0, 0, 0, 0, 9] \n",
" True \n",
" True \n",
" False \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 4711728 \n",
" Q99767269 \n",
" P569 \n",
" ^1980-06-11T00:00:00Z/11 \n",
" P569 \n",
" ^1980-06-11T00:00:00Z/11 \n",
" [1980, 6, 11, 0, 0, 0, 11] \n",
" [1980, 6, 11, 0, 0, 0, 11] \n",
" True \n",
" True \n",
" True \n",
" \n",
" \n",
" 4711729 \n",
" Q99824424 \n",
" P569 \n",
" ^1998-02-10T00:00:00Z/11 \n",
" P569 \n",
" ^1998-02-10T00:00:00Z/11 \n",
" [1998, 2, 10, 0, 0, 0, 11] \n",
" [1998, 2, 10, 0, 0, 0, 11] \n",
" True \n",
" True \n",
" True \n",
" \n",
" \n",
" 4711730 \n",
" Q99858723 \n",
" P570 \n",
" ^1908-01-01T00:00:00Z/9 \n",
" P570 \n",
" ^1908-01-01T00:00:00Z/9 \n",
" [1908, 1, 1, 0, 0, 0, 9] \n",
" [1908, 1, 1, 0, 0, 0, 9] \n",
" True \n",
" True \n",
" True \n",
" \n",
" \n",
" 4711731 \n",
" Q99859256 \n",
" P569 \n",
" ^1976-12-03T00:00:00Z/11 \n",
" P569 \n",
" ^1976-12-03T00:00:00Z/11 \n",
" [1976, 12, 3, 0, 0, 0, 11] \n",
" [1976, 12, 3, 0, 0, 0, 11] \n",
" True \n",
" True \n",
" True \n",
" \n",
" \n",
" 4711732 \n",
" Q99945100 \n",
" P571 \n",
" ^2015-00-00T00:00:00Z/9 \n",
" P571 \n",
" ^2015-00-00T00:00:00Z/9 \n",
" [2015, 0, 0, 0, 0, 0, 9] \n",
" [2015, 0, 0, 0, 0, 0, 9] \n",
" True \n",
" True \n",
" True \n",
" \n",
" \n",
"
\n",
"
2912668 rows × 10 columns
\n",
"
"
],
"text/plain": [
" node1 label node2 newNode2Label \\\n",
"116 Q12260242 P569 ^00000001964-00-00T00:00:00Z/9 P569 \n",
"134 Q12352405 P569 ^00000001987-03-19T00:00:00Z/11 P569 \n",
"273 Q16506839 P569 ^00000001718-01-01T00:00:00Z/9 P569 \n",
"291 Q1686296 P571 ^00000002013-01-01T00:00:00Z/11 P571 \n",
"390 Q258257 P569 ^00000001140-00-00T00:00:00Z/9 P569 \n",
"... ... ... ... ... \n",
"4711728 Q99767269 P569 ^1980-06-11T00:00:00Z/11 P569 \n",
"4711729 Q99824424 P569 ^1998-02-10T00:00:00Z/11 P569 \n",
"4711730 Q99858723 P570 ^1908-01-01T00:00:00Z/9 P570 \n",
"4711731 Q99859256 P569 ^1976-12-03T00:00:00Z/11 P569 \n",
"4711732 Q99945100 P571 ^2015-00-00T00:00:00Z/9 P571 \n",
"\n",
" newNode2 parsed_date \\\n",
"116 ^1964-00-00T00:00:00Z/9 [1964, 0, 0, 0, 0, 0, 9] \n",
"134 ^1987-03-19T00:00:00Z/11 [1987, 3, 19, 0, 0, 0, 11] \n",
"273 ^1718-01-01T00:00:00Z/9 [1718, 1, 1, 0, 0, 0, 9] \n",
"291 ^2013-01-01T00:00:00Z/11 [2013, 1, 1, 0, 0, 0, 11] \n",
"390 ^1140-00-00T00:00:00Z/9 [1140, 0, 0, 0, 0, 0, 9] \n",
"... ... ... \n",
"4711728 ^1980-06-11T00:00:00Z/11 [1980, 6, 11, 0, 0, 0, 11] \n",
"4711729 ^1998-02-10T00:00:00Z/11 [1998, 2, 10, 0, 0, 0, 11] \n",
"4711730 ^1908-01-01T00:00:00Z/9 [1908, 1, 1, 0, 0, 0, 9] \n",
"4711731 ^1976-12-03T00:00:00Z/11 [1976, 12, 3, 0, 0, 0, 11] \n",
"4711732 ^2015-00-00T00:00:00Z/9 [2015, 0, 0, 0, 0, 0, 9] \n",
"\n",
" parsed_date2 valid_date same_date str_same_date \n",
"116 [1964, 0, 0, 0, 0, 0, 9] True True False \n",
"134 [1987, 3, 19, 0, 0, 0, 11] True True False \n",
"273 [1718, 1, 1, 0, 0, 0, 9] True True False \n",
"291 [2013, 1, 1, 0, 0, 0, 11] True True False \n",
"390 [1140, 0, 0, 0, 0, 0, 9] True True False \n",
"... ... ... ... ... \n",
"4711728 [1980, 6, 11, 0, 0, 0, 11] True True True \n",
"4711729 [1998, 2, 10, 0, 0, 0, 11] True True True \n",
"4711730 [1908, 1, 1, 0, 0, 0, 9] True True True \n",
"4711731 [1976, 12, 3, 0, 0, 0, 11] True True True \n",
"4711732 [2015, 0, 0, 0, 0, 0, 9] True True True \n",
"\n",
"[2912668 rows x 10 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_df[date_df['same_date']]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "failing-mileage",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No. of deleted statements having exact same date in dataset as of 7th December 2020: 30262\n"
]
}
],
"source": [
"print(f\"No. of deleted statements having exact same date in dataset as of 7th December 2020: {sum(date_df['str_same_date'])}\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "clean-canon",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"def customTimeDelta(date1,date2):\n",
" try:\n",
"# print(date1,date2)\n",
" if date1[0] > sys.maxint or date2[0] > sys.maxint:\n",
" return None\n",
" if date1 == None or date2 == None:\n",
" return None\n",
" date1 = datetime.datetime(*date1[:-1])\n",
" date2 = datetime.datetime(*date2[:-1])\n",
" timeDelta = date1 - date2\n",
" return timeDelta\n",
" except OverflowError:\n",
" return None\n",
" except TypeError:\n",
" return None\n",
" except:\n",
" return None"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "waiting-thumbnail",
"metadata": {},
"outputs": [],
"source": [
"date_df1 = date_df[(date_df['valid_date'] == True) & (date_df['same_date'] == False)]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "superior-gothic",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
":1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" date_df1['time_delta'] = date_df1.apply(lambda x: customTimeDelta(x.parsed_date, x.parsed_date2), axis=1)\n"
]
}
],
"source": [
"date_df1['time_delta'] = date_df1.apply(lambda x: customTimeDelta(x.parsed_date, x.parsed_date2), axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "muslim-stephen",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 None\n",
"1 None\n",
"2 None\n",
"3 None\n",
"4 None\n",
" ... \n",
"4711659 None\n",
"4711682 None\n",
"4711690 None\n",
"4711700 None\n",
"4711703 None\n",
"Name: time_delta, Length: 1798925, dtype: object"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"date_df1['time_delta']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dutch-projection",
"metadata": {},
"outputs": [],
"source": [
"# !head ../../opAnalysis/removed_statements_both_nonredirects_new_vals_date_measured.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "prepared-magnet",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "relative-tomorrow",
"metadata": {},
"source": [
"### Numeric Values Comparison"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "revolutionary-mistake",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:quantity)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "eight-haven",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4323460 ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "unknown-nirvana",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-09 15:19:10 sqlstore]: IMPORT graph directly into table graph_71 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_qty.tsv ...\n",
"[2021-04-09 15:19:30 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_71_c1.\"node1\", graph_71_c1.\"label\", graph_71_c1.\"node2\", graph_51_c2.\"label\" \"_aLias.node2;newLabel\", graph_51_c2.\"node2\" \"_aLias.node2;newVal\"\n",
" FROM graph_51 AS graph_51_c2, graph_71 AS graph_71_c1\n",
" WHERE graph_51_c2.\"node1\"=graph_71_c1.\"node1\"\n",
" AND (graph_71_c1.\"label\" = graph_51_c2.\"label\")\n",
" PARAS: []\n",
"---------------------------------------------\n",
"[2021-04-09 15:19:30 sqlstore]: CREATE INDEX on table graph_71 column node1 ...\n",
"[2021-04-09 15:19:32 sqlstore]: ANALYZE INDEX on table graph_71 column node1 ...\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/claims.quantity.tsv.gz \\\n",
" --match \"non: (x)-[r]->(y), quantity: (x)-[s]->(z)\" \\\n",
" --where \"r.label = s.label\" \\\n",
" --return 'x, r.label, y, s.label as `node2;newLabel`, z as `node2;newVal`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "convertible-softball",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3239699 ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "unlikely-overhead",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "historical-copying",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-09 15:26:38 sqlstore]: IMPORT graph directly into table graph_72 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv ...\n",
"[2021-04-09 15:29:43 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", graph_72_c1.\"node2;newLabel\" \"_aLias.node2;newLabel\", max(graph_72_c1.\"node2;newVal\") \"_aLias.node2;newValue\", count(graph_72_c1.\"node2;newVal\") \"_aLias.node2;branching\"\n",
" FROM graph_72 AS graph_72_c1\n",
" WHERE graph_72_c1.\"node2;newLabel\"=graph_72_c1.\"node2;newLabel\"\n",
" AND graph_72_c1.\"node2;newVal\"=graph_72_c1.\"node2;newVal\"\n",
" GROUP BY graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", \"_aLias.node2;newLabel\"\n",
" PARAS: []\n",
"---------------------------------------------\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv \\\n",
" --match \"(node1)-[r]->(node2{newLabel: newLabel, newVal: newValue})\" \\\n",
" --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "waiting-citizenship",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "unlike-huntington",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" node2;newLabel \n",
" node2;newValue \n",
" node2;branching \n",
" \n",
" \n",
" \n",
" \n",
" 2501639 \n",
" Q999961 \n",
" P1082 \n",
" +17243[+17243,+17243] \n",
" P1082 \n",
" +8883 \n",
" 27 \n",
" \n",
" \n",
" 2501640 \n",
" Q999961 \n",
" P1082 \n",
" +6925 \n",
" P1082 \n",
" +8883 \n",
" 27 \n",
" \n",
" \n",
" 2501641 \n",
" Q999961 \n",
" P1082 \n",
" +8653[+8653,+8653] \n",
" P1082 \n",
" +8883 \n",
" 27 \n",
" \n",
" \n",
" 2501642 \n",
" Q999961 \n",
" P2046 \n",
" +23.95Q712226 \n",
" P2046 \n",
" +23.952616Q712226 \n",
" 1 \n",
" \n",
" \n",
" 2501643 \n",
" Q999988 \n",
" P2046 \n",
" +1000[+1000,+1000]Q81292 \n",
" P2046 \n",
" +1000Q81292 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node1 label node2 node2;newLabel \\\n",
"2501639 Q999961 P1082 +17243[+17243,+17243] P1082 \n",
"2501640 Q999961 P1082 +6925 P1082 \n",
"2501641 Q999961 P1082 +8653[+8653,+8653] P1082 \n",
"2501642 Q999961 P2046 +23.95Q712226 P2046 \n",
"2501643 Q999988 P2046 +1000[+1000,+1000]Q81292 P2046 \n",
"\n",
" node2;newValue node2;branching \n",
"2501639 +8883 27 \n",
"2501640 +8883 27 \n",
"2501641 +8883 27 \n",
"2501642 +23.952616Q712226 1 \n",
"2501643 +1000Q81292 1 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.tail()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "confident-carolina",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\r\n",
"P1733\tP4876\t+1014280\tP4876\t+28977\t1\r\n",
"P2040\tP4876\t+34596\tP4876\t+38623\t1\r\n",
"P2349\tP4876\t+12367\tP4876\t+12500\t3\r\n",
"P2427\tP4876\t+95000\tP4876\t+96793\t4\r\n",
"P2518\tP4876\t+11126\tP4876\t+11145\t1\r\n",
"P2725\tP4876\t+2232\tP4876\t+3907\t1\r\n",
"P2816\tP4876\t+32155\tP4876\t+34149\t2\r\n",
"P3289\tP4876\t+113576\tP4876\t+123199\t1\r\n",
"P3400\tP4876\t+123817\tP4876\t+123817\t4\r\n"
]
}
],
"source": [
"!head ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "adjusted-discretion",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7\n"
]
}
],
"source": [
"import re\n",
"test_str = \"+123817Q\"\n",
"temp = re.search(r'[a-z]', test_str, re.I)\n",
"if temp is not None:\n",
" print(temp.start())\n",
"else:\n",
" print(\"Not found\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "toxic-heart",
"metadata": {},
"outputs": [],
"source": [
"def splitIntoParts(text):\n",
" temp = re.search(r'[a-z]', text, re.I)\n",
" firstAlpha1 = -1 if temp is None else temp.start()\n",
" alpha1 = \"\" if firstAlpha1 == -1 else text[firstAlpha1:]\n",
" text = text if firstAlpha1 == -1 else text[:firstAlpha1]\n",
" \n",
" temp = re.search(r'\\[', text, re.I)\n",
" firstBracket1 = -1 if temp is None else temp.start()\n",
" brack1 = \"\" if firstBracket1 == -1 else text[firstBracket1:]\n",
" \n",
" num1 = text if firstBracket1 == -1 else text[:firstBracket1]\n",
" \n",
" return num1, brack1, alpha1"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "impressed-monthly",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('+1234', '[+1, -1]', 'Q12345')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"splitIntoParts(\"+1234[+1, -1]Q12345\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "sunset-fraction",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c86b1765daec4bc084f0c0f399a69dfd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/2501645 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0mval1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0mval2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mIndexError\u001b[0m: list index out of range"
]
}
],
"source": [
"from dateutil.parser import parse\n",
"import re\n",
"import rltk\n",
"from rltk.similarity import levenshtein_distance as ld\n",
"from nltk.tokenize import word_tokenize as wt\n",
"from tqdm.notebook import tqdm\n",
"\n",
"def is_num(string):\n",
" try: \n",
" float(string)\n",
" return True\n",
"\n",
" except ValueError:\n",
" return False\n",
" \n",
"f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",\"r\").read().split(\"\\n\")\n",
"fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",\"w\")\n",
"firstLine = f1[0]\n",
"\n",
"fNum.write(firstLine+\"\\tNumNE\\tRangeNE\\tNumNRangeNE\\tUnitNE\\n\")\n",
"# fnonQnd.write(f1[0]+\"\\n\")\n",
"\n",
"for i in tqdm(range(1,len(f1))):\n",
" line = f1[i]\n",
" val1 = line.split(\"\\t\")[2]\n",
" val2 = line.split(\"\\t\")[4]\n",
" \n",
" \n",
" num1, brack1, alpha1 = splitIntoParts(val1)\n",
" num2, brack2, alpha2 = splitIntoParts(val2)\n",
" \n",
"# print(val1, num1, brack1, alpha1)\n",
"# print(val2, num2, brack2, alpha2)\n",
" \n",
" fNum.write(line + \"\\t\" + str(num1 != num2) + \"\\t\" + str(brack1 != brack2) + \"\\t\" + str((num1 != num2) and (brack1 != brack2)) + \"\\t\" + str(alpha1 != alpha2) + \"\\n\")\n",
"\n",
"fNum.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "continued-landscape",
"metadata": {},
"outputs": [],
"source": [
"# from dateutil.parser import parse\n",
"# import re\n",
"# import rltk\n",
"# from rltk.similarity import levenshtein_distance as ld\n",
"# from nltk.tokenize import word_tokenize as wt\n",
"# from tqdm.notebook import tqdm\n",
"\n",
"# def is_num(string):\n",
"# try: \n",
"# float(string)\n",
"# return True\n",
"\n",
"# except ValueError:\n",
"# return False\n",
" \n",
"# f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals.tsv\",\"r\").read().split(\"\\n\")\n",
"# fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_measured.tsv\",\"w\")\n",
"\n",
"# firstLine = f1[0]\n",
"\n",
"# fNum.write(firstLine+\"\\tDiff\\tLevDist\\n\")\n",
"# # fnonQnd.write(f1[0]+\"\\n\")\n",
"\n",
"# for i in tqdm(range(1,len(f1))):\n",
"# line = f1[i]\n",
"# val1 = line.split(\"\\t\")[2]\n",
"# val2 = line.split(\"\\t\")[4]\n",
"# if is_num(val2):\n",
"# diff = float(val2) - float(val1)\n",
"# fNum.write(line+ \"\\t\" + str(diff) + \"\\tNone\\n\")\n",
"# else:\n",
"# LevDist = ld(val1,val2)\n",
"# fNum.write(line+ \"\\tNone\\t\" + str(LevDist) + \"\\n\")\n",
"\n",
"# fNum.close()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "impaired-venue",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"num_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "strange-alcohol",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" node2;newLabel \n",
" node2;newValue \n",
" node2;branching \n",
" NumNE \n",
" RangeNE \n",
" NumNRangeNE \n",
" UnitNE \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" P1733 \n",
" P4876 \n",
" +1014280 \n",
" P4876 \n",
" +28977 \n",
" 1 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 1 \n",
" P2040 \n",
" P4876 \n",
" +34596 \n",
" P4876 \n",
" +38623 \n",
" 1 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 2 \n",
" P2349 \n",
" P4876 \n",
" +12367 \n",
" P4876 \n",
" +12500 \n",
" 3 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 3 \n",
" P2427 \n",
" P4876 \n",
" +95000 \n",
" P4876 \n",
" +96793 \n",
" 4 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 4 \n",
" P2518 \n",
" P4876 \n",
" +11126 \n",
" P4876 \n",
" +11145 \n",
" 1 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node1 label node2 node2;newLabel node2;newValue node2;branching \\\n",
"0 P1733 P4876 +1014280 P4876 +28977 1 \n",
"1 P2040 P4876 +34596 P4876 +38623 1 \n",
"2 P2349 P4876 +12367 P4876 +12500 3 \n",
"3 P2427 P4876 +95000 P4876 +96793 4 \n",
"4 P2518 P4876 +11126 P4876 +11145 1 \n",
"\n",
" NumNE RangeNE NumNRangeNE UnitNE \n",
"0 True False False False \n",
"1 True False False False \n",
"2 True False False False \n",
"3 True False False False \n",
"4 True False False False "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "hindu-merit",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "hollywood-boring",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 2.501575e+06\n",
"mean 6.733284e+01\n",
"std 5.003042e+02\n",
"min 1.000000e+00\n",
"25% 1.000000e+00\n",
"50% 2.000000e+00\n",
"75% 1.100000e+01\n",
"max 2.132100e+04\n",
"Name: node2;branching, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_df['node2;branching'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "moral-history",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 2501575 quantities, there are 1496454 cases where numbers have got updated, 2037283 cases where ranges have got updated, 1069289 cases where number and range both have got updated, 78048 cases were the unit has got updated\n"
]
}
],
"source": [
"print(f\"Out of {len(num_df)} quantities, there are {num_df['NumNE'].sum()} cases where numbers have got updated, {num_df['RangeNE'].sum()} cases where ranges have got updated, {num_df['NumNRangeNE'].sum()} cases where number and range both have got updated, {num_df['UnitNE'].sum()} cases were the unit has got updated\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "assured-recipient",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"^C\r\n"
]
}
],
"source": [
"# !kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
"# ../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n",
"# --match \"r: (x)-[r]->(y), c: (x)-[s]->(z)\" \\\n",
"# --where \"r.label = s.label\" \\\n",
"# --return 'x, r.label, y, s.label as node2;newLabl, z as node2;nw' \\\n",
"# -o ../../opAnalysis/removed_statements_both_nonredirects_new_vals.tsv"
]
},
{
"cell_type": "markdown",
"id": "muslim-dryer",
"metadata": {},
"source": [
"### Qnodes comparison"
]
},
{
"cell_type": "markdown",
"id": "brilliant-picnic",
"metadata": {},
"source": [
"#### Qnodes type segregation\n",
"\n",
"Here, for each qnode to qnode removed statement, we analyze:\n",
"* How many statements have node1 which is an instance/subclass/both of something else\n",
"* How many statements have node2 which is an instance/subclass/both of something else"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "described-america",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "universal-surprise",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "elder-tissue",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "killing-emphasis",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "answering-sheriff",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "intimate-sullivan",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "surprising-clone",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"15682364 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "innovative-thread",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 3500869 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv\n",
" 3396316 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv\n",
" 14206459 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv\n",
" 21103644 total\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1*"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "accompanied-lighting",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 10064419 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv\n",
" 6622159 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv\n",
" 12057758 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv\n",
" 28744336 total\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2*"
]
},
{
"cell_type": "markdown",
"id": "verified-vegetable",
"metadata": {},
"source": [
"#### Qnodes to Qnodes (instance/subclass analysis)\n",
"\n",
"Here, we analyze how many P31 relations were deleted, how many were updated to P31/P279/nothing. We do the same thing for P279 relations that were deleted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "quick-welsh",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --return 'count(a)' \\\n",
" --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "satisfied-philosophy",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "southern-daisy",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P279]->(b)' \\\n",
" --return 'count(a)' \\\n",
" --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "subtle-tract",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P279]->(b)' \\\n",
" --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv"
]
},
{
"cell_type": "markdown",
"id": "opponent-bible",
"metadata": {},
"source": [
"##### Analyze for P31 relations"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "soviet-liverpool",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "imposed-pound",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "provincial-limit",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "dynamic-persian",
"metadata": {},
"outputs": [],
"source": [
"!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n",
" ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "material-routine",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "aboriginal-injection",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3611396 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n",
"2864334 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n",
"150123 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n",
"106540 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n",
"703480 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "perceived-hopkins",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n",
" --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "antique-neighborhood",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "alleged-destiny",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 626925 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv\r\n",
" 76556 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv\r\n",
" 703481 total\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_*"
]
},
{
"cell_type": "markdown",
"id": "opposed-palmer",
"metadata": {},
"source": [
"##### Analyze for P279 relations"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "hybrid-hacker",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "reliable-ontario",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "radio-bumper",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "loving-switzerland",
"metadata": {},
"outputs": [],
"source": [
"!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n",
" ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "prostate-trace",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "subsequent-recovery",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"935667 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n",
"865917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n",
"454917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n",
"421734 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n",
"36568 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "hazardous-liberal",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"^C\r\n",
"\r\n",
"Keyboard interrupt in ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz --filter-mode NONE --input-keys node1 --filter-keys node1 -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv.\r\n"
]
}
],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n",
" --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "manual-embassy",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "determined-wonder",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 35004 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv\r\n",
" 1565 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv\r\n",
" 36569 total\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_*"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "hundred-equivalent",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Q12328016-P31-Q12737077-46763b70-0\tQ12328016\tP31\tQ12737077\r\n"
]
}
],
"source": [
"!zgrep -P \"Q12328016\\tP31\" ../../wikidata-20210215/derived.P31.tsv.gz"
]
},
{
"cell_type": "markdown",
"id": "cordless-better",
"metadata": {},
"source": [
"# Deprecated Statements Analysis"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "canadian-broadcast",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-14 17:58:03 sqlstore]: IMPORT graph directly into table graph_75 from /data/wd-correctness/data/deprecated.tsv ...\n",
"[2021-04-14 17:58:36 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT *\n",
" FROM graph_75 AS graph_75_c1\n",
" WHERE (graph_75_c1.\"label\" IN (?))\n",
" PARAS: ['P31']\n",
"---------------------------------------------\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../data/deprecated.tsv \\\n",
" --match '(node1)-[prop]->(node2)' \\\n",
" --where 'prop.label in [\"P31\"]' \\\n",
" -o ../../opAnalysis/deprecated_P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "blank-capital",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3303205 ../../opAnalysis/deprecated_P31.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/deprecated_P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "unique-stevens",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"dep_P31_df = pd.read_csv(\"../../opAnalysis/deprecated_P31.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "alternate-snowboard",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Q67206691 2546256\n",
"Q523 352194\n",
"Q67206785 60055\n",
"Q1931185 43618\n",
"Q318 35768\n",
"Q2247863 21906\n",
"Q13890 17533\n",
"Q46587 16574\n",
"Q6243 13070\n",
"Q2154519 12184\n",
"Q1153690 10092\n",
"Q83373 9998\n",
"Q72802727 9948\n",
"Q1491746 9106\n",
"Q71798532 7641\n",
"Name: node2, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dep_P31_df['node2'].value_counts().head(15)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "coupled-rochester",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-14 18:00:30 query]: SQL Translation:\r\n",
"---------------------------------------------\r\n",
" SELECT *\r\n",
" FROM graph_75 AS graph_75_c1\r\n",
" WHERE (graph_75_c1.\"label\" IN (?))\r\n",
" PARAS: ['P279']\r\n",
"---------------------------------------------\r\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../data/deprecated.tsv \\\n",
" --match '(node1)-[prop]->(node2)' \\\n",
" --where 'prop.label in [\"P279\"]' \\\n",
" -o ../../opAnalysis/deprecated_P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bibliographic-wayne",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"307 ../../opAnalysis/deprecated_P279.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/deprecated_P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "caring-gossip",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"dep_P279_df = pd.read_csv(\"../../opAnalysis/deprecated_P279.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "saving-competition",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Q14659 11\n",
"Q245932 8\n",
"Q27825887 7\n",
"Q21451942 6\n",
"Q1861967 6\n",
"Q1457669 4\n",
"Q58840094 4\n",
"Q3024240 3\n",
"Q26772977 3\n",
"Q387917 3\n",
"Q192089 3\n",
"Q276314 3\n",
"Q152574 2\n",
"Q209363 2\n",
"Q7033037 2\n",
"Name: node2, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dep_P279_df['node2'].value_counts().head(15)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "critical-pendant",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/nas/home/kshenoy/miniconda3/envs/kgtkEnv/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (7,14) have mixed types.Specify dtype option on import or set low_memory=False.\n",
" has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n"
]
}
],
"source": [
"import pandas as pd\n",
"dep_df = pd.read_csv(\"../../data/deprecated.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "abstract-disclaimer",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"P31 3303204\n",
"P2215 2236125\n",
"P2214 2159860\n",
"P2216 816191\n",
"P2583 461113\n",
"P1090 290549\n",
"P215 273273\n",
"P6879 107265\n",
"P7015 66554\n",
"P881 55717\n",
"Name: label, dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dep_df.label.value_counts().head(10)"
]
},
{
"cell_type": "markdown",
"id": "dramatic-spyware",
"metadata": {},
"source": [
"Fin."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "general-hometown",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "kgtkEnv",
"language": "python",
"name": "kgtkenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "288px"
},
"toc_section_display": true,
"toc_window_display": true
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}