{
"cells": [
{
"cell_type": "markdown",
"id": "statutory-onion",
"metadata": {},
"source": [
"# Understanding Removed Statements Dataset\n",
"\n",
"Source of data: [GDrive | Removed Stataments of Wikidata | Feb 1 2021](https://drive.google.com/file/d/1TQP1rADdvhDjsvBpLzSE9Bx3n73wf-Md/view?usp=sharing)\n",
"\n",
"Steps performed:\n",
"* Divide dataset into 2 halves - redirected and non-redirected. Redirected dataset has either node1 or node2 as redirected. But non-redirected has both node1, node2 not redirected\n",
"\n",
"\n",
"**Summary**\n",
"\n",
"Removed Statements dataset has 76.5M removed statements. Out of these, "
]
},
{
"cell_type": "markdown",
"id": "christian-mounting",
"metadata": {},
"source": [
"## Redirects determination and division of dataset into 2 halves\n",
"\n",
"* Since, redirects dataset was not present, a SPARQL query was run to determine all the redirects existing at the moment. This was done on Feb 19, 2021. This was executed using [SPARQL query](https://query.wikidata.org/). Query run was:\n",
" ```\n",
" SELECT ?old_node\n",
" WHERE {\n",
" ?old_node owl:sameAs ?new_node.\n",
" }\n",
" ```\n",
"* This has few lexemes as well which we don't need. So, I then ran the query:\n",
" ```\n",
" SELECT ?old_node\n",
" WHERE {\n",
" ?old_node owl:sameAs ?new_node.\n",
" ?new_node rdf:type ontolex:LexicalEntry.\n",
" }\n",
" ```\n",
"* After removing the lexemes from the nodes file, a final redirected non-lexemes file was created with data from Feb 19, 2021: `data/SPARQL_redirects_non-lexemes.tsv`.\n",
"* Using this reduced dataset, I was able to determine in the removed_statements.tsv dataset, which nodes have been redirected - `../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`. This has removed statements in which either node1 or node2 is redirected.\n",
"* After this, I am extracting the removed statements not present in this subset meaning it would correspond to all removed statements in neither node1 nor node2 is redirected - `../opAnalysis/removed_statements_both_nonredirects.tsv`\n",
"\n",
"For this, I am using the following set of commands"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "thick-absorption",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "boolean-string",
"metadata": {},
"outputs": [],
"source": [
"# On the basis of SPARQL\n",
"!kgtk ifexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_redirects_basis_node1.tsv\n",
"!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_basis_node1.tsv\n",
"!kgtk ifexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_redirects_basis_node2.tsv\n",
"!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys id \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_basis_node2.tsv\n",
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_redirects_basis_node1.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n",
" -o ../../opAnalysis/temp1.tsv\n",
"!kgtk cat -i ../../opAnalysis/temp1.tsv \\\n",
" ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n",
" -o ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv\n",
"!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n",
" --filter-on ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects.tsv"
]
},
{
"cell_type": "markdown",
"id": "committed-volunteer",
"metadata": {},
"source": [
"## P31 edges distribution"
]
},
{
"cell_type": "markdown",
"id": "objective-range",
"metadata": {},
"source": [
"Now, we'll determine in this redirected dataset - `../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`, how many of these are P31 edges and determine more stats on these"
]
},
{
"cell_type": "markdown",
"id": "final-fraud",
"metadata": {},
"source": [
"### For Redirected Removed Statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "analyzed-silicon",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --return 'b, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_redirects_P31_stats1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "smaller-eugene",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Q4167836 \n",
" 526207 \n",
" 0.213808 \n",
" \n",
" \n",
" Q17329259 \n",
" 301359 \n",
" 0.122448 \n",
" \n",
" \n",
" Q5 \n",
" 222809 \n",
" 0.090531 \n",
" \n",
" \n",
" Q4167410 \n",
" 108583 \n",
" 0.044119 \n",
" \n",
" \n",
" Q13442814 \n",
" 101156 \n",
" 0.041102 \n",
" \n",
" \n",
" Q7187 \n",
" 88231 \n",
" 0.035850 \n",
" \n",
" \n",
" Q11266439 \n",
" 61007 \n",
" 0.024788 \n",
" \n",
" \n",
" Q4423781 \n",
" 53671 \n",
" 0.021808 \n",
" \n",
" \n",
" Q17143521 \n",
" 51581 \n",
" 0.020958 \n",
" \n",
" \n",
" Q15917122 \n",
" 50642 \n",
" 0.020577 \n",
" \n",
" \n",
" Q486972 \n",
" 49257 \n",
" 0.020014 \n",
" \n",
" \n",
" Q16521 \n",
" 46522 \n",
" 0.018903 \n",
" \n",
" \n",
" Q318 \n",
" 26722 \n",
" 0.010858 \n",
" \n",
" \n",
" Q532 \n",
" 23721 \n",
" 0.009638 \n",
" \n",
" \n",
" Q20900710 \n",
" 23482 \n",
" 0.009541 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"Q4167836 526207 0.213808\n",
"Q17329259 301359 0.122448\n",
"Q5 222809 0.090531\n",
"Q4167410 108583 0.044119\n",
"Q13442814 101156 0.041102\n",
"Q7187 88231 0.035850\n",
"Q11266439 61007 0.024788\n",
"Q4423781 53671 0.021808\n",
"Q17143521 51581 0.020958\n",
"Q15917122 50642 0.020577\n",
"Q486972 49257 0.020014\n",
"Q16521 46522 0.018903\n",
"Q318 26722 0.010858\n",
"Q532 23721 0.009638\n",
"Q20900710 23482 0.009541"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_P31_stats1.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "japanese-upgrade",
"metadata": {},
"source": [
"Find unique list of redirected nodes"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "former-hudson",
"metadata": {},
"outputs": [],
"source": [
"!kgtk unique -i ../../opAnalysis/removed_statements_redirects_basis_node1.tsv --column node1 -o ../../opAnalysis/removed_statements_redirects_basis_node1_nodes_only.tsv"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "circular-heritage",
"metadata": {},
"outputs": [],
"source": [
"!kgtk unique -i ../../opAnalysis/removed_statements_redirects_basis_node2.tsv --column node2 -o ../../opAnalysis/removed_statements_redirects_basis_node2_nodes_only.tsv"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "irish-envelope",
"metadata": {},
"outputs": [],
"source": [
"!kgtk cat -i ../../opAnalysis/removed_statements_redirects_basis_node1_nodes_only.tsv \\\n",
" ../../opAnalysis/removed_statements_redirects_basis_node2_nodes_only.tsv \\\n",
" -o ../../opAnalysis/removed_statements_redirects_nodes_only.tsv"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "bridal-effort",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_redirects_nodes_only.tsv \\\n",
" --match '(node1)-[label]->(node2)' \\\n",
" --return 'node1, label.label, sum(node2)' \\\n",
" -o ../../opAnalysis/removed_statements_redirects_nodes_only_unique.tsv"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "accomplished-wallpaper",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1864249 ../../opAnalysis/removed_statements_redirects_nodes_only_unique.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_redirects_nodes_only_unique.tsv"
]
},
{
"cell_type": "markdown",
"id": "suburban-cosmetic",
"metadata": {},
"source": [
"### For non-redirected removed statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "characteristic-still",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --return 'b, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "subsequent-dutch",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" Q4167836 \n",
" 368888 \n",
" 0.102453 \n",
" \n",
" \n",
" Q4167410 \n",
" 132403 \n",
" 0.036773 \n",
" \n",
" \n",
" Q5 \n",
" 130252 \n",
" 0.036176 \n",
" \n",
" \n",
" Q571 \n",
" 126883 \n",
" 0.035240 \n",
" \n",
" \n",
" Q11266439 \n",
" 125824 \n",
" 0.034946 \n",
" \n",
" \n",
" Q838948 \n",
" 119928 \n",
" 0.033308 \n",
" \n",
" \n",
" Q486972 \n",
" 108105 \n",
" 0.030025 \n",
" \n",
" \n",
" Q532 \n",
" 106786 \n",
" 0.029658 \n",
" \n",
" \n",
" Q783794 \n",
" 101121 \n",
" 0.028085 \n",
" \n",
" \n",
" Q1539532 \n",
" 78186 \n",
" 0.021715 \n",
" \n",
" \n",
" Q916333 \n",
" 62789 \n",
" 0.017439 \n",
" \n",
" \n",
" Q16521 \n",
" 53402 \n",
" 0.014832 \n",
" \n",
" \n",
" Q7366 \n",
" 45005 \n",
" 0.012499 \n",
" \n",
" \n",
" Q13406463 \n",
" 42582 \n",
" 0.011827 \n",
" \n",
" \n",
" Q18593264 \n",
" 40505 \n",
" 0.011250 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"Q4167836 368888 0.102453\n",
"Q4167410 132403 0.036773\n",
"Q5 130252 0.036176\n",
"Q571 126883 0.035240\n",
"Q11266439 125824 0.034946\n",
"Q838948 119928 0.033308\n",
"Q486972 108105 0.030025\n",
"Q532 106786 0.029658\n",
"Q783794 101121 0.028085\n",
"Q1539532 78186 0.021715\n",
"Q916333 62789 0.017439\n",
"Q16521 53402 0.014832\n",
"Q7366 45005 0.012499\n",
"Q13406463 42582 0.011827\n",
"Q18593264 40505 0.011250"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "whole-influence",
"metadata": {},
"source": [
"## Properties Distribution"
]
},
{
"cell_type": "markdown",
"id": "international-conditioning",
"metadata": {},
"source": [
"### For redirected removed statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "known-moore",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n",
" --match 'o: (a)-[r]->(b)' \\\n",
" --return 'r.label, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_redirects_props_dist.tsv"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "unlikely-default",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" P31 \n",
" 2381072 \n",
" 0.234921 \n",
" \n",
" \n",
" P17 \n",
" 357286 \n",
" 0.035251 \n",
" \n",
" \n",
" P1433 \n",
" 299464 \n",
" 0.029546 \n",
" \n",
" \n",
" P735 \n",
" 295778 \n",
" 0.029182 \n",
" \n",
" \n",
" P50 \n",
" 268412 \n",
" 0.026482 \n",
" \n",
" \n",
" P2860 \n",
" 243607 \n",
" 0.024035 \n",
" \n",
" \n",
" P625 \n",
" 227779 \n",
" 0.022473 \n",
" \n",
" \n",
" P106 \n",
" 185184 \n",
" 0.018271 \n",
" \n",
" \n",
" P131 \n",
" 183759 \n",
" 0.018130 \n",
" \n",
" \n",
" P21 \n",
" 179069 \n",
" 0.017667 \n",
" \n",
" \n",
" P921 \n",
" 167723 \n",
" 0.016548 \n",
" \n",
" \n",
" P279 \n",
" 162394 \n",
" 0.016022 \n",
" \n",
" \n",
" P1566 \n",
" 160213 \n",
" 0.015807 \n",
" \n",
" \n",
" P684 \n",
" 152695 \n",
" 0.015065 \n",
" \n",
" \n",
" P703 \n",
" 119182 \n",
" 0.011759 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"P31 2381072 0.234921\n",
"P17 357286 0.035251\n",
"P1433 299464 0.029546\n",
"P735 295778 0.029182\n",
"P50 268412 0.026482\n",
"P2860 243607 0.024035\n",
"P625 227779 0.022473\n",
"P106 185184 0.018271\n",
"P131 183759 0.018130\n",
"P21 179069 0.017667\n",
"P921 167723 0.016548\n",
"P279 162394 0.016022\n",
"P1566 160213 0.015807\n",
"P684 152695 0.015065\n",
"P703 119182 0.011759"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_props_dist.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "satisfactory-future",
"metadata": {},
"source": [
"### For non-redirected removed statements"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "seasonal-composite",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" --match 'o: (a)-[r]->(b)' \\\n",
" --return 'r.label, count(distinct a)' \\\n",
" -o ../../opAnalysis/removed_statements_nonredirects_props_dist.tsv"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "straight-haiti",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" count \n",
" perc \n",
" \n",
" \n",
" parent \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" P2093 \n",
" 6173393 \n",
" 0.161314 \n",
" \n",
" \n",
" P1476 \n",
" 4238487 \n",
" 0.110754 \n",
" \n",
" \n",
" P31 \n",
" 3327644 \n",
" 0.086953 \n",
" \n",
" \n",
" P569 \n",
" 2011539 \n",
" 0.052563 \n",
" \n",
" \n",
" P625 \n",
" 1494410 \n",
" 0.039050 \n",
" \n",
" \n",
" P577 \n",
" 1116328 \n",
" 0.029170 \n",
" \n",
" \n",
" P234 \n",
" 999522 \n",
" 0.026118 \n",
" \n",
" \n",
" P570 \n",
" 983201 \n",
" 0.025692 \n",
" \n",
" \n",
" P131 \n",
" 927413 \n",
" 0.024234 \n",
" \n",
" \n",
" P364 \n",
" 870224 \n",
" 0.022739 \n",
" \n",
" \n",
" P2044 \n",
" 780870 \n",
" 0.020405 \n",
" \n",
" \n",
" P279 \n",
" 765112 \n",
" 0.019993 \n",
" \n",
" \n",
" P969 \n",
" 732461 \n",
" 0.019140 \n",
" \n",
" \n",
" P356 \n",
" 413439 \n",
" 0.010803 \n",
" \n",
" \n",
" P637 \n",
" 387091 \n",
" 0.010115 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count perc\n",
"parent \n",
"P2093 6173393 0.161314\n",
"P1476 4238487 0.110754\n",
"P31 3327644 0.086953\n",
"P569 2011539 0.052563\n",
"P625 1494410 0.039050\n",
"P577 1116328 0.029170\n",
"P234 999522 0.026118\n",
"P570 983201 0.025692\n",
"P131 927413 0.024234\n",
"P364 870224 0.022739\n",
"P2044 780870 0.020405\n",
"P279 765112 0.019993\n",
"P969 732461 0.019140\n",
"P356 413439 0.010803\n",
"P637 387091 0.010115"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_props_dist.tsv',sep='\\t')\n",
"df1.columns = ['parent','count']\n",
"df1 = df1.sort_values(by=['count'],ascending=False)\n",
"df1 = df1.set_index('parent')\n",
"tot = df1['count'].sum()\n",
"df1['perc'] = df1['count'] / tot\n",
"df1.head(15)"
]
},
{
"cell_type": "markdown",
"id": "martial-friday",
"metadata": {},
"source": [
"# Comparison Removed NR dataset with Qnodes, literals\n",
"\n",
"First, let's split this dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "engaging-salon",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:wikibase\\-item)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_qnode.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "closed-toyota",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:quantity)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_qty.tsv\n",
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:string)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str.tsv\n",
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:`wikibase-item`)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_qnode.tsv\n",
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:time)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_date.tsv\n"
]
},
{
"cell_type": "markdown",
"id": "rough-emerald",
"metadata": {},
"source": [
"### String Comparison"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "amateur-effort",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"^C\r\n"
]
}
],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/claims.string.tsv.gz \\\n",
" --match \"r: (x)-[r]->(y), c: (x)-[s]->(z)\" \\\n",
" --where \"r.label = s.label\" \\\n",
" --return 'x as `node1`, r.label as `label`, y as `node2`, s.label as `node2;newLabl`, z as `node2;nw`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv \\\n",
" --graph-cache ~/temp2.sqlite3.db"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "separate-georgia",
"metadata": {},
"outputs": [],
"source": [
"# !sed -i '1s/.*/node1\\tlabel\\tnode2\\tnode2;newLabl\\tnode2;nw/' removed_statements_both_nonredirects_newSeg_str_new_vals.tsv"
]
},
{
"cell_type": "markdown",
"id": "disturbed-geology",
"metadata": {},
"source": [
"The strings subset has a branching factor of approx 10. i.e. 1 removed statement with string literal has been replaced by around 10 new statements (with same node1-label combination). Doing the same comparisons won't give us much insights. Instead, let's truncate this dataset while retaining just the counts of branching factor from each of these node1-label combinations. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "downtown-alabama",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-12 08:48:21 sqlstore]: IMPORT graph directly into table graph_1 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv ...\n",
"[2021-04-12 09:25:32 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\", graph_1_c1.\"node2;newLabl\" \"_aLias.node2;newLabel\", max(graph_1_c1.\"node2;nw\") \"_aLias.node2;newValue\", count(graph_1_c1.\"node2;nw\") \"_aLias.node2;branching\"\n",
" FROM graph_1 AS graph_1_c1\n",
" WHERE graph_1_c1.\"node2;newLabl\"=graph_1_c1.\"node2;newLabl\"\n",
" AND graph_1_c1.\"node2;nw\"=graph_1_c1.\"node2;nw\"\n",
" GROUP BY graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\", \"_aLias.node2;newLabel\"\n",
" PARAS: []\n",
"---------------------------------------------\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv \\\n",
" --match \"(node1)-[r]->(node2{newLabl: newLabel, nw: newValue})\" \\\n",
" --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv \\\n",
" --graph-cache ~/sqlite3_caches/temptrunc.sqlite3.db"
]
},
{
"cell_type": "markdown",
"id": "tropical-cooperation",
"metadata": {},
"source": [
"On this truncated dataset, we will next compute the stats and comparisons. Note: Our original string literals subset of removed statements was around 9 GB. With the join operation with claims, this had increased to 90 GB. We have now truncated this dataset to 778 MB"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "meaning-closure",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"14349490 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "crude-denmark",
"metadata": {},
"outputs": [],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured.tsv"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "white-valuation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\r\n",
"P1003\tP1630\thttp://alephnew.bibnat.ro:8991/F?func=find-b&request=$1&find_code=SYS&adjacent=Y&local_base=NLR10\tP1630\t\"http://aleph.bibnat.ro:8991/F/?func=direct&local_base=NLR10&doc_number=$1\"\t1\r\n",
"P1004\tP1921\thttp://musicbrainz.org/$1/place\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n",
"P1004\tP1921\thttps://musicbrainz.org/place/$1\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n",
"P1005\tP1630\thttp://purl.pt/index/geral/aut/PT/$1.html\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n",
"P1005\tP1630\thttp://urn.bn.pt/nca/unimarc-authorities/txt?id=$1\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n",
"P1006\tP1630\thttp://data.bibliotheken.nl/id/thes/p$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n",
"P1006\tP1630\thttp://opc4.kb.nl/DB=1/XMLPRS=Y/PPN?PPN=$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n",
"P1006\tP1630\thttp://opc4.kb.nl/PPN?PPN=$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n",
"P1006\tP1630\thttps://data.bibliotheken.nl/doc/thes/p$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n"
]
}
],
"source": [
"!head ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "successful-singer",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "17865317d0014ed9bed573ef559e6d8c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from dateutil.parser import parse\n",
"import re\n",
"import rltk\n",
"from rltk.similarity import levenshtein_distance as ld\n",
"from nltk.tokenize import word_tokenize as wt\n",
"from tqdm.notebook import tqdm\n",
"\n",
"f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv\",\"r\")\n",
"fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv\",\"w\")\n",
"\n",
"firstLine = next(f1).rstrip()\n",
"\n",
"fStr.write(firstLine+\"\\tVersionBool\\tRangeBool\\tLevDist\\tRearranged\\tRearrangedFirstNP\\n\")\n",
"\n",
"for line in tqdm(f1):\n",
" line = line.rstrip()\n",
" val1 = line.split(\"\\t\")[2]\n",
" val2 = line.split(\"\\t\")[4]\n",
" val2 = val2[1:-1]\n",
" versionBool = bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",val1))\n",
" rangeBool = bool(re.fullmatch(\"[\\d]+[-|–][\\d]+\",val1))\n",
" LevDist = ld(val1,val2)\n",
" rearranged = set(wt(val1)) == set(wt(val2))\n",
" rearrangedFirstNP = set(wt(val1)) == set(wt(val2[1:]))\n",
" fStr.write(line+ \"\\t\" + str(versionBool) + \"\\t\" + str(rangeBool) + \"\\t\" + \\\n",
" str(LevDist) + \"\\t\" + str(rearranged) + \"\\t\" + str(rearrangedFirstNP) + \"\\n\")\n",
"\n",
"fStr.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "international-violation",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys label node1 \\\n",
" --filter-keys label node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_unmatched2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "tracked-carroll",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1927007651 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "vocational-pound",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"14349490 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "trained-tuning",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_unmatched.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_unmatched.tsv"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "economic-friday",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\tVersionBool\tRangeBool\tLevDist\tRearranged\tRearrangedFirstNP\r\n",
"P1003\tP1630\thttp://alephnew.bibnat.ro:8991/F?func=find-b&request=$1&find_code=SYS&adjacent=Y&local_base=NLR10\tP1630\t\"http://aleph.bibnat.ro:8991/F/?func=direct&local_base=NLR10&doc_number=$1\"\t1\r\n",
"\tFalse\tFalse\t51\tFalse\tFalse\r\n",
"P1004\tP1921\thttp://musicbrainz.org/$1/place\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n",
"\tFalse\tFalse\t6\tFalse\tFalse\r\n",
"P1004\tP1921\thttps://musicbrainz.org/place/$1\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n",
"\tFalse\tFalse\t1\tFalse\tFalse\r\n",
"P1005\tP1630\thttp://purl.pt/index/geral/aut/PT/$1.html\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n",
"\tFalse\tFalse\t31\tFalse\tFalse\r\n",
"P1005\tP1630\thttp://urn.bn.pt/nca/unimarc-authorities/txt?id=$1\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n"
]
}
],
"source": [
"!head ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured.tsv"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "daily-complexity",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"str_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "otherwise-bones",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" node2;newLabel \n",
" node2;newValue \n",
" node2;branching \n",
" VersionBool \n",
" RangeBool \n",
" LevDist \n",
" Rearranged \n",
" RearrangedFirstNP \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" P1003 \n",
" P1630 \n",
" http://alephnew.bibnat.ro:8991/F?func=find-b&r... \n",
" P1630 \n",
" http://aleph.bibnat.ro:8991/F/?func=direct&loc... \n",
" 1 \n",
" False \n",
" False \n",
" 51 \n",
" False \n",
" False \n",
" \n",
" \n",
" 1 \n",
" P1004 \n",
" P1921 \n",
" http://musicbrainz.org/$1/place \n",
" P1921 \n",
" http://musicbrainz.org/place/$1 \n",
" 1 \n",
" False \n",
" False \n",
" 6 \n",
" False \n",
" False \n",
" \n",
" \n",
" 2 \n",
" P1004 \n",
" P1921 \n",
" https://musicbrainz.org/place/$1 \n",
" P1921 \n",
" http://musicbrainz.org/place/$1 \n",
" 1 \n",
" False \n",
" False \n",
" 1 \n",
" False \n",
" False \n",
" \n",
" \n",
" 3 \n",
" P1005 \n",
" P1630 \n",
" http://purl.pt/index/geral/aut/PT/$1.html \n",
" P1630 \n",
" http://urn.bn.pt/nca/unimarc-authorities/html?... \n",
" 3 \n",
" False \n",
" False \n",
" 31 \n",
" False \n",
" False \n",
" \n",
" \n",
" 4 \n",
" P1005 \n",
" P1630 \n",
" http://urn.bn.pt/nca/unimarc-authorities/txt?i... \n",
" P1630 \n",
" http://urn.bn.pt/nca/unimarc-authorities/html?... \n",
" 3 \n",
" False \n",
" False \n",
" 3 \n",
" False \n",
" False \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node1 label node2 \\\n",
"0 P1003 P1630 http://alephnew.bibnat.ro:8991/F?func=find-b&r... \n",
"1 P1004 P1921 http://musicbrainz.org/$1/place \n",
"2 P1004 P1921 https://musicbrainz.org/place/$1 \n",
"3 P1005 P1630 http://purl.pt/index/geral/aut/PT/$1.html \n",
"4 P1005 P1630 http://urn.bn.pt/nca/unimarc-authorities/txt?i... \n",
"\n",
" node2;newLabel node2;newValue \\\n",
"0 P1630 http://aleph.bibnat.ro:8991/F/?func=direct&loc... \n",
"1 P1921 http://musicbrainz.org/place/$1 \n",
"2 P1921 http://musicbrainz.org/place/$1 \n",
"3 P1630 http://urn.bn.pt/nca/unimarc-authorities/html?... \n",
"4 P1630 http://urn.bn.pt/nca/unimarc-authorities/html?... \n",
"\n",
" node2;branching VersionBool RangeBool LevDist Rearranged \\\n",
"0 1 False False 51 False \n",
"1 1 False False 6 False \n",
"2 1 False False 1 False \n",
"3 3 False False 31 False \n",
"4 3 False False 3 False \n",
"\n",
" RearrangedFirstNP \n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "mounted-saint",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"62146"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(str_df[str_df['LevDist'] == 0])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "senior-custom",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",\"http://purl.pt/index/geral/aut/PT/$1.html\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "restricted-locking",
"metadata": {},
"outputs": [],
"source": [
"str_df['node2;branching'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "hundred-entrepreneur",
"metadata": {},
"outputs": [],
"source": [
"str_df['node2;branching'].value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "secret-contest",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14349489"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df['node2;branching'].value_counts().sum()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "editorial-romance",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 14349489 updates, 254884 correspond to changes due to version change with average branching factor: 1.7222579683306916\n"
]
}
],
"source": [
"print(f\"Out of {len(str_df)} updates, {str_df['VersionBool'].sum()} correspond to changes due to version change with average branching factor: {str_df[str_df['VersionBool'] == True]['node2;branching'].mean()}\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "social-plenty",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 254884.000000\n",
"mean 3.783427\n",
"std 3.277387\n",
"min 0.000000\n",
"25% 2.000000\n",
"50% 3.000000\n",
"75% 5.000000\n",
"max 209.000000\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df[str_df['VersionBool'] == True].LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "promising-hopkins",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 14349489 updates, 321953 correspond to changes due to range change with average branching factor: 1.0656493339089868\n"
]
}
],
"source": [
"print(f\"Out of {len(str_df)} updates, {str_df['RangeBool'].sum()} correspond to changes due to range change with average branching factor: {str_df[str_df['RangeBool'] == True]['node2;branching'].mean()}\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "varied-reform",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"count 321953.000000\n",
"mean 2.343702\n",
"std 2.188649\n",
"min 0.000000\n",
"25% 1.000000\n",
"50% 2.000000\n",
"75% 3.000000\n",
"max 47.000000\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df[str_df['RangeBool'] == True].LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "annoying-transaction",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 14349489 updates, 234286 correspond to changes due to rearrangement with average branching factor: 3.4882536728613744\n"
]
}
],
"source": [
"print(f\"Out of {len(str_df)} updates, {str_df['Rearranged'].sum()} correspond to changes due to rearrangement with average branching factor: {str_df[str_df['Rearranged'] == True]['node2;branching'].mean()}\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "three-characteristic",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 234286.000000\n",
"mean 2.873257\n",
"std 2.006146\n",
"min 0.000000\n",
"25% 0.000000\n",
"50% 4.000000\n",
"75% 4.000000\n",
"max 56.000000\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df[str_df['Rearranged'] == True].LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "military-coordinator",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 1.434949e+07\n",
"mean 1.153558e+01\n",
"std 5.467439e+00\n",
"min 0.000000e+00\n",
"25% 9.000000e+00\n",
"50% 1.200000e+01\n",
"75% 1.400000e+01\n",
"max 1.445000e+03\n",
"Name: LevDist, dtype: float64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str_df.LevDist.describe()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "european-treat",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.0, 'count v/s Lev edit distances')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEICAYAAABPgw/pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAaJ0lEQVR4nO3df5hdVX3v8feHhEDJYPgRO8YkkNCmai7RSqb8KLTOVNTwQ/LcW9qb3BShQtOnNj7eKtUg3IhYW9CLFgGLuZZyhZgRKUKKkdgiU+69CIVUIQQaHCGYREiQQHCAFlO/94+9xuycnplz5mTPnMny83qe8+Tsvdas/T1rzvmcPev8iCICMzPb/x3Q7gLMzKwaDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFAN9tHki6VdFO6fpSkAUkT9mG8zZJOTdc/KumLVdVqeXOgW8vKwbMPYyyW9OWxPu5oiYgfRERHRPw7gKQ+SRfsw3h/HhENf35fj2N5cKBbu50BrG13EWY5cKBnQtJMSbdKelbSc5KuSfsPkHSJpKck7ZD0JUlTUlu3pK0145T/3L9U0s3pZ34saaOkrtR2I3AU8HdpieHDdWp6TNKZpe2Jqb7jBmsD3gHcKelgSTel2l+Q9ICkzhHOwQGSlkv6fhrnZklHpLZvSFpW0/8hSf9liLFOlHRvquUhSd2lttmS/jHNyd8DU0ttsyRFuq2fBH4DuCbN0TVDHOuc9Pt5TtLFNW3l5Zy6czTUcSRdJWmLpBclrZf0GzXj1v3dpva696fU9t70u31e0jpJR6f9kvTZdD97UdIGSccO+0uzakVE2y7A9cAO4JEm+n4W+G66PA680M7ax9MFmAA8lOZoMnAwcEpqey/QDxwDdAC3Ajemtm5ga81Ym4FT0/VLgX8FTk/H+Avgvnp9h6hrBbCqtH0G8Fhp+0Tg2+n6HwJ/BxySjjUfeM0Q49Y9LvAB4D5gBnAQ8AVgdWp7D/D/Sn3nAi8AB9UZZzrwXLrdg086zwGvTe3fBj6TjvGbwI+Bm1LbLCCAiWm7D7hgmDmaCwykcQ5K4+6u+R3c1GiO6h0H+D3gSGAi8CHgGeDgRr9bhr8/LaS4P70pjXsJcG9qexewHjgMUOozrd2Pj5+nS3sPXtyJj6OJQK/5ufcD17d78sbLBTgJeHYwRGra7gLeV9p+A/CT9GDspnGg/0OpbS7wSr2+Q9T1yynsDknbq4AVpfZPAP8jXX8vcC/w5iZub93jAo8Bby9tTyvd1kOBl4CjU9snh7oPAR8hPemV9q0DzqX4q2Q3MLnU9mVaD/QVQG9pezLwKvUDfcg5anSc1Od54C2NfrcN7k/fAM4vbR8AvAwcDfwWxcnWicAB7X5c/Dxe2rrkEhH3ADvL+yT9kqQ705+I/0fSG+v86GJg9ZgUuX+YCTwVEbvrtL0eeKq0/RRFwDW7nPFM6frLwMGSJjbzgxHRTxGy75Z0CHAWRfgNOp096+c3UoRmr6QfSvqUpAObrHHQ0cDX0nLEC+nY/w50RsSPga8Di1LfxRRPMEON8zuD46SxTqF4gng98HxEvFTq/1SdMZr1emDL4EYa97kh+o5ojiRdmJZGdqXbMIXS8hBD/26Huz8dDVxVmpedFGfj0yPiW8A1wLXADkkrJb1muBtv1RqPa+grgfdHxHzgQuDz5ca0Xjcb+FYbahuvtgBHDRG0P6R4EA4aPMPcTnHGeshgg4q32r12BMdt5ruXV1OE50Lg0RTySHodRUD+M0BE/CQiPh4Rc4FfB86kWCYZiS3AaRFxWOlycERsK9ci6SSKZYS7hxnnxppxJkfE5cDTwOGSJpf6HzVMTY3m6GmKAAUgPfEdWXeg4edor+Ok9fIPA78LHB4RhwG7KMK3keHuT1uAP6yZm1+IiHtTjZ9Lj925wK8Af9rE8awi4yrQJXVQ3FG/Kum7FGug02q6LQJuifS2MAPgnyiC4XJJk9OLZyenttXAn6QX8jqAPwe+ks6+Hqc4KzsjneldQrGO26ztFGvzw+kF3gn8EXufnZ8G3BlR/N0uqUfSvPSk8iLFUslPhxn3wHQ7By8TgeuAT5ZepHutpIWln1lL8eR2GcUcDDX+TRR/VbxL0oQ0frekGRHxFPAg8HFJkySdArx7mDobzdEtwJmSTpE0KdVW93HZYI5qj3MoxRP3s8BESSuAZs+Wh7s/XQdcJOk/pZqmSPqddP3XJJ2Q7ksvUazRD/c7tIqNq0CnqOeFiPjV0uVNNX0W4eWWvaQnt3dTrFn/ANgK/NfUfD3Fn+r3AE9SPMjen35uF/A+4IvANooH4V7vemngL4BL0p/fFw5R29MULyL+OvCVUlPt2xVfRxFuL1Islfxjqnsoa4FXSpdLgauANcA3Jf2Y4gXSE0q1/BvFi8KnsveTS23NWyj+ovgoRSBuoTjTHHy8/Lc07k7gY8CXhqnzKuDs9I6Qz9U51kbgj1M9T1Oscw/1OxhujmqPsw64k+JJ+ymK3/uW/zBiHcPdnyLia8AVFMs+LwKPUDw5Q/GE8b/SbXiKYuno080c06qhdILUvgKkWcAdEXFs2r4X+GxEfFWSKF4Aeii1vZHiTjo72l24tSydTT8DHBMRL7a7HrNctPUMXdJqirO3N0jaKul8YAlwvqSHgI0UZ0qDFlG8I8Bhvn87guLdLQ5zswq1/QzdzMyqMd7W0M3MrEVNvZ94NEydOjVmzZrV0s++9NJLTJ48uXHHcWB/qdV1Vst1Vst17rF+/fofRUT9txe36xNN8+fPj1bdfffdLf/sWNtfanWd1XKd1XKdewAPxnj8pKiZmVXHgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWWibR/93xcbtu3ivOVfr9u2+fIzxrgaM7PxoeEZuqTrJe2Q9EiDfr8mabeks6srz8zMmtXMkssNwILhOqT/EusK4JsV1GRmZi1oGOgRcQ/Ff7U1nPcDfwvsqKIoMzMbuab+g4va/yaupm06xf+H2EPx/1feERG3DDHOUmApQGdn5/ze3t6Wit6xcxfbX6nfNm/6lJbGHC0DAwN0dHS0u4yGXGe1XGe1XOcePT096yOiq15bFS+K/iXwkYj4afFfgA4tIlYCKwG6urqiu7u7pQNevep2rtxQv/TNS1obc7T09fXR6u0cS66zWq6zWq6zOVUEehfF/wAOMBU4XdLuiLitgrHNzKxJ+xzoETF78LqkGyiWXG7b13HNzGxkGga6pNVANzBV0lbgY8CBABFx3ahWZ2ZmTWsY6BGxuNnBIuK8farGzMxa5o/+m5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmWgY6JKul7RD0iNDtC+R9LCkDZLulfSW6ss0M7NGmjlDvwFYMEz7k8DbImIe8AlgZQV1mZnZCE1s1CEi7pE0a5j2e0ub9wEzKqjLzMxGSBHRuFMR6HdExLEN+l0IvDEiLhiifSmwFKCzs3N+b2/viAsG2LFzF9tfqd82b/qUlsYcLQMDA3R0dLS7jIZcZ7VcZ7Vc5x49PT3rI6KrXlvDM/RmSeoBzgdOGapPRKwkLcl0dXVFd3d3S8e6etXtXLmhfumbl7Q25mjp6+uj1ds5llxntVxntVxncyoJdElvBr4InBYRz1UxppmZjcw+v21R0lHArcA5EfH4vpdkZmataHiGLmk10A1MlbQV+BhwIEBEXAesAI4EPi8JYPdQ6ztmZjZ6mnmXy+IG7RcAdV8ENTOzseNPipqZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZploGOiSrpe0Q9IjQ7RL0uck9Ut6WNJx1ZdpZmaNNHOGfgOwYJj204A56bIU+Kt9L8vMzEaqYaBHxD3AzmG6LAS+FIX7gMMkTauqQDMza44ionEnaRZwR0QcW6ftDuDyiPi/afsu4CMR8WCdvkspzuLp7Oyc39vb21LRO3buYvsr9dvmTZ/S0pijZWBggI6OjnaX0ZDrrJbrrJbr3KOnp2d9RHTVa5s4qkeuERErgZUAXV1d0d3d3dI4V6+6nSs31C9985LWxhwtfX19tHo7x5LrrJbrrJbrbE4V73LZBswsbc9I+8zMbAxVEehrgPekd7ucCOyKiKcrGNfMzEag4ZKLpNVANzBV0lbgY8CBABFxHbAWOB3oB14Gfn+0ijUzs6E1DPSIWNygPYA/rqwiMzNriT8pamaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZaKpQJe0QNImSf2SltdpP0rS3ZK+I+lhSadXX6qZmQ2nYaBLmgBcC5wGzAUWS5pb0+0S4OaIeCuwCPh81YWamdnwmjlDPx7oj4gnIuJVoBdYWNMngNek61OAH1ZXopmZNUMRMXwH6WxgQURckLbPAU6IiGWlPtOAbwKHA5OBUyNifZ2xlgJLATo7O+f39va2VPSOnbvY/kr9tnnTp7Q05mgZGBigo6Oj3WU05Dqr5Tqr5Tr36OnpWR8RXfXaJlZ0jMXADRFxpaSTgBslHRsRPy13ioiVwEqArq6u6O7ubulgV6+6nSs31C9985LWxhwtfX19tHo7x5LrrJbrrJbrbE4zSy7bgJml7RlpX9n5wM0AEfFt4GBgahUFmplZc5oJ9AeAOZJmS5pE8aLnmpo+PwDeDiDpTRSB/myVhZqZ2fAaBnpE7AaWAeuAxyjezbJR0mWSzkrdPgT8gaSHgNXAedFocd7MzCrV1Bp6RKwF1tbsW1G6/ihwcrWlmZnZSPiTomZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSaaCnRJCyRtktQvafkQfX5X0qOSNkr6crVlmplZIxMbdZA0AbgWeAewFXhA0pqIeLTUZw5wEXByRDwv6RdHq2AzM6uvmTP044H+iHgiIl4FeoGFNX3+ALg2Ip4HiIgd1ZZpZmaNKCKG7yCdDSyIiAvS9jnACRGxrNTnNuBx4GRgAnBpRNxZZ6ylwFKAzs7O+b29vS0VvWPnLra/Ur9t3vQpLY05WgYGBujo6Gh3GQ25zmq5zmq5zj16enrWR0RXvbaGSy5NmgjMAbqBGcA9kuZFxAvlThGxElgJ0NXVFd3d3S0d7OpVt3Plhvqlb17S2pijpa+vj1Zv51hyndVyndVync1pZsllGzCztD0j7SvbCqyJiJ9ExJMUZ+tzqinRzMya0UygPwDMkTRb0iRgEbCmps9tFGfnSJoK/ArwRHVlmplZIw0DPSJ2A8uAdcBjwM0RsVHSZZLOSt3WAc9JehS4G/jTiHhutIo2M7P/qKk19IhYC6yt2beidD2AD6aLmZm1gT8pamaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZaKpQJe0QNImSf2Slg/T77clhaSu6ko0M7NmNAx0SROAa4HTgLnAYklz6/Q7FPgAcH/VRZqZWWPNnKEfD/RHxBMR8SrQCyys0+8TwBXAv1ZYn5mZNUkRMXwH6WxgQURckLbPAU6IiGWlPscBF0fEb0vqAy6MiAfrjLUUWArQ2dk5v7e3t6Wid+zcxfZX6rfNmz6lpTFHy8DAAB0dHe0uoyHXWS3XWS3XuUdPT8/6iKi7rD1xXweXdADwGeC8Rn0jYiWwEqCrqyu6u7tbOubVq27nyg31S9+8pLUxR0tfXx+t3s6x5Dqr5Tqr5Tqb08ySyzZgZml7Rto36FDgWKBP0mbgRGCNXxg1MxtbzQT6A8AcSbMlTQIWAWsGGyNiV0RMjYhZETELuA84q96Si5mZjZ6GgR4Ru4FlwDrgMeDmiNgo6TJJZ412gWZm1pym1tAjYi2wtmbfiiH6du97WWZmNlL+pKiZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpaJpgJd0gJJmyT1S1pep/2Dkh6V9LCkuyQdXX2pZmY2nIaBLmkCcC1wGjAXWCxpbk237wBdEfFm4BbgU1UXamZmw2vmDP14oD8inoiIV4FeYGG5Q0TcHREvp837gBnVlmlmZo0oIobvIJ0NLIiIC9L2OcAJEbFsiP7XAM9ExJ/VaVsKLAXo7Oyc39vb21LRO3buYvsr9dvmTZ/S0pijZWBggI6OjnaX0ZDrrJbrrJbr3KOnp2d9RHTVa5tY5YEk/R7QBbytXntErARWAnR1dUV3d3dLx7l61e1cuaF+6ZuXtDbmaOnr66PV2zmWXGe1XGe1XGdzmgn0bcDM0vaMtG8vkk4FLgbeFhH/Vk15ZmbWrGbW0B8A5kiaLWkSsAhYU+4g6a3AF4CzImJH9WWamVkjDQM9InYDy4B1wGPAzRGxUdJlks5K3T4NdABflfRdSWuGGM7MzEZJU2voEbEWWFuzb0Xp+qkV12VmZiPkT4qamWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZaCrQJS2QtElSv6TlddoPkvSV1H6/pFmVV2pmZsNqGOiSJgDXAqcBc4HFkubWdDsfeD4ifhn4LHBF1YWamdnwJjbR53igPyKeAJDUCywEHi31WQhcmq7fAlwjSRERFdbalFnLv153/+bLzxjjSszMxlYzgT4d2FLa3gqcMFSfiNgtaRdwJPCjcidJS4GlaXNA0qZWigam1o7diNr3N8OIa20T11kt11kt17nH0UM1NBPolYmIlcDKfR1H0oMR0VVBSaNuf6nVdVbLdVbLdTanmRdFtwEzS9sz0r66fSRNBKYAz1VRoJmZNaeZQH8AmCNptqRJwCJgTU2fNcC56frZwLfasX5uZvbzrOGSS1oTXwasAyYA10fERkmXAQ9GxBrgr4EbJfUDOylCfzTt87LNGNpfanWd1XKd1XKdTZBPpM3M8uBPipqZZcKBbmaWif0u0Bt9DcEY1zJT0t2SHpW0UdIH0v4jJP29pO+lfw9P+yXpc6n2hyUdN8b1TpD0HUl3pO3Z6asa+tNXN0xK+9v2VQ6SDpN0i6R/kfSYpJPG43xK+pP0O39E0mpJB4+X+ZR0vaQdkh4p7RvxHEo6N/X/nqRz6x1rFOr8dPrdPyzpa5IOK7VdlOrcJOldpf2jmgn16iy1fUhSSJqatts2nwBExH5zoXhR9vvAMcAk4CFgbhvrmQYcl64fCjxO8fUInwKWp/3LgSvS9dOBbwACTgTuH+N6Pwh8Gbgjbd8MLErXrwP+KF1/H3Bdur4I+MoY1vi/gQvS9UnAYeNtPik+SPck8AuleTxvvMwn8JvAccAjpX0jmkPgCOCJ9O/h6frhY1DnO4GJ6foVpTrnpsf7QcDslAMTxiIT6tWZ9s+keLPIU8DUds9nROx3gX4SsK60fRFwUbvrKtVzO/AOYBMwLe2bBmxK178ALC71/1m/MahtBnAX8FvAHekO96PSg+dnc5vupCel6xNTP41BjVNSUKpm/7iaT/Z8MvqIND93AO8aT/MJzKoJyhHNIbAY+EJp/179RqvOmrb/DKxK1/d6rA/O6VhlQr06Kb7m5C3AZvYEelvnc39bcqn3NQTT21TLXtKf0W8F7gc6I+Lp1PQM0Jmut7P+vwQ+DPw0bR8JvBARu+vUstdXOQCDX+Uw2mYDzwJ/k5aGvihpMuNsPiNiG/A/gR8AT1PMz3rG33yWjXQOx8Nj7b0UZ7sMU09b6pS0ENgWEQ/VNLW1zv0t0MclSR3A3wL/PSJeLLdF8XTc1veGSjoT2BER69tZRxMmUvxp+1cR8VbgJYrlgZ8ZJ/N5OMUX0s0GXg9MBha0s6aRGA9z2Iiki4HdwKp211JL0iHAR4EV7a6l1v4W6M18DcGYknQgRZiviohb0+7tkqal9mnAjrS/XfWfDJwlaTPQS7HschVwmIqvaqitpV1f5bAV2BoR96ftWygCfrzN56nAkxHxbET8BLiVYo7H23yWjXQO2/ZYk3QecCawJD35MEw97ajzlyiezB9Kj6kZwD9Lel2769zfAr2ZryEYM5JE8SnZxyLiM6Wm8lchnEuxtj64/z3plfATgV2lP4NHTURcFBEzImIWxZx9KyKWAHdTfFVDvTrH/KscIuIZYIukN6Rdb6f4muZxNZ8USy0nSjok3QcG6xxX81ljpHO4DninpMPTXyTvTPtGlaQFFEuDZ0XEyzX1L0rvGJoNzAH+iTZkQkRsiIhfjIhZ6TG1leLNEc/Q7vmselF+tC8UryI/TvHK9sVtruUUij9dHwa+my6nU6yP3gV8D/gH4IjUXxT/Wcj3gQ1AVxtq7mbPu1yOoXhQ9ANfBQ5K+w9O2/2p/ZgxrO9XgQfTnN5G8Y6AcTefwMeBfwEeAW6kePfFuJhPYDXF2v5PKMLm/FbmkGINuz9dfn+M6uynWGsefDxdV+p/capzE3Baaf+oZkK9OmvaN7PnRdG2zWdE+KP/Zma52N+WXMzMbAgOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy8f8BVU8FU6OhyzoAAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"str_df.LevDist.hist(bins=50).set_title(\"count v/s Lev edit distances\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "dangerous-civilian",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Text(0.5, 0, 'Levenshtein Distance'), Text(0, 0.5, 'Count (in millions)')]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmEAAAF+CAYAAADKnc2YAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAgNklEQVR4nO3de7RdZX3u8e8jAWnVSG2iwxPAIMW2DC+oMVq1Fqn2xNaC1kvkqK0tyuk4xUPr5TQ9dkQbjz2xttZovUVFtFYBLWpULLaKpUOFEBQUsCqCHLOPRxJvuzcv4O/8sWZ0ZbMvKyFzv3uv9f2Mscde853vWvM3mbh5nO+75puqQpIkSYvrDq0LkCRJmkSGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWpgWYawJOckuTnJNSP2f2qS65Jcm+SdfdcnSZK0kCzH54QleRTwr8Dbq+q+C/Q9AbgAOKWqvpXk7lV182LUKUmSNJdleSesqi4FvjncluT4JH+X5Mok/5Tk57pdzwFeW1Xf6t5rAJMkSc0tyxA2h+3Ac6vqwcALgNd17fcB7pPkE0kuS7KhWYWSJEmdFa0LOBSS3Bl4OPDuJPua79j9XgGcAJwMHA1cmuR+VfXtRS5TkiTpR8YihDG4o/ftqjppln27gcur6gfAjUm+yCCUXbGI9UmSJO1nLIYjq2qaQcB6CkAGHtDtfh+Du2AkWcVgePKGBmVKkiT9yLIMYUneBXwK+Nkku5OcATwdOCPJ1cC1wGld94uBbyS5DrgEeGFVfaNF3ZIkSfssy0dUSJIkLXfL8k6YJEnScmcIkyRJamDZfTty1apVtXbt2tZlSJIkLejKK6/cW1WrZ9u37ELY2rVr2bVrV+syJEmSFpTkprn2ORwpSZLUgCFMkiSpAUOYJElSA4YwSZKkBgxhkiRJDRjCJEmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ2saF2ANInO3rSZqb3T+7WtWbWSbVu3NKpIkrTYDGFSA1N7p1mxfuP+bTvPb1SNJKkFQ5i0hHnHTJLGlyFMWsK8YyZJ48uJ+ZIkSQ0YwiRJkhowhEmSJDVgCJMkSWqgtxCW5JwkNye5ZoF+D0lyS5In91WLJEnSUtPnnbBzgQ3zdUhyGPBy4CM91iFJkrTk9BbCqupS4JsLdHsu8LfAzX3VIUmStBQ1mxOWZA3wROD1rWqQJElqpeXE/FcBf1hVP1yoY5Izk+xKsmvPnj39VyZJktSzlk/MXweclwRgFfCrSW6pqvfN7FhV24HtAOvWravFLFKSJKkPzUJYVR2373WSc4EPzhbAJEmSxlFvISzJu4CTgVVJdgMvBg4HqKo39HVcSZKk5aC3EFZVpx9A32f1VYckSdJS5BPzJUmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKmBFa0LkJarszdtZmrv9H5ta1atZNvWLY0qkiQtJ4Yw6SBN7Z1mxfqN+7ftPL9RNZKk5cbhSEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaMIRJkiQ14ALe0pg6e9NmpvZO79e2ZtVKtm3d0qgiSdIwQ5g0pqb2TrNi/cb923ae36gaSdJMDkdKkiQ1YAiTJElqwBAmSZLUgCFMkiSpAUOYJElSA72FsCTnJLk5yTVz7H96ks8m+VySTyZ5QF+1SJIkLTV93gk7F9gwz/4bgV+qqvsBLwW291iLJEnSktLbc8Kq6tIka+fZ/8mhzcuAo/uqRZIkaalZKnPCzgA+3LoISZKkxdL8iflJHs0ghD1ynj5nAmcCHHvssYtUmSRJUn+a3glLcn/gzcBpVfWNufpV1faqWldV61avXr14BUqSJPWkWQhLcixwIfDMqvpiqzokSZJa6G04Msm7gJOBVUl2Ay8GDgeoqjcAm4GfBl6XBOCWqlrXVz2SJElLSZ/fjjx9gf3PBp7d1/Glg3H2ps1M7Z3er23NqpVs27qlUUWSpHHVfGK+tJRM7Z1mxfqN+7ftPL9RNZKkcbZUHlEhSZI0UQxhkiRJDRjCJEmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaWNG6AOlQO3vTZqb2Tu/XtmbVSrZt3dKoIkmSbssQprEztXeaFes37t+28/xG1UiSNDuHIyVJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaMIRJkiQ1YAiTJElqwBAmSZLUQG8hLMk5SW5Ocs0c+5Pk1UmuT/LZJA/qqxZJkqSlps87YecCG+bZ/zjghO7nTOD1PdYiSZK0pPQWwqrqUuCb83Q5DXh7DVwGHJXknn3VI0mStJS0nBO2Bvjq0Pburu02kpyZZFeSXXv27FmU4iRJkvq0LCbmV9X2qlpXVetWr17duhxJkqTbrWUImwKOGdo+umuTJEkaey1D2A7gN7tvST4M+E5Vfa1hPZIkSYtmRV8fnORdwMnAqiS7gRcDhwNU1RuAi4BfBa4H/h347b5qkSRJWmp6C2FVdfoC+wv4vb6OL0mStJQti4n5kiRJ42bBO2FJjgQeD/wi8J+A/wCuAT5UVdf2W54kSdJ4mjeEJfkTBgHs48DlwM3AkcB9gK1dQHt+VX225zolSZLGykJ3wnZW1Yvn2PfKJHcHjj3ENUlaRGdv2szU3un92tasWsm2rVsaVSRJk2HeEFZVH5rZluQOwJ2rarqqbmZwd0zSMjW1d5oV6zfu37bz/EbVSNLkGGlifpJ3JlmZ5E4M5oNdl+SF/ZYmSZI0vkb9duSJVTUNPAH4MHAc8My+ipIkSRp3oz4n7PAkhzMIYX9VVT9IUv2VJd2Wc5ckSeNk1BD2RuArwNXApUnuBUzP+w7pEHPukiRpnIwUwqrq1cCrh5puSvLofkqSJEkafyOFsCR3BJ4ErJ3xHseBJEmSDsKow5HvB74DXAl8r79yJEmSJsOoIezoqtrQayWSJEkTZNRHVHwyyf16rUSSJGmCjHon7JHAs5LcyGA4MkBV1f17q0ySJGmMjRrCHtdrFZIkSRNmpOHIqroJOAr49e7nqK5NkiRJB2HUtSPPBv4GuHv3844kz+2zMEmSpHE26nDkGcBDq+rfAJK8HPgU8Jq+CpMkSRpno347MsCtQ9u3dm2SJEk6CKPeCXsrcHmS93bbTwDe0ktFkiRJE2DUtSNfmeTjDB5VAfDbVfWZ3qqSJEkac/OGsCQrq2o6yd2Ar3Q/+/bdraq+2W95kiRJ42mhO2HvBB7PYM3IGmpPt33vnuqSJEkaa/OGsKp6fPf7uMUpR5IkaTIsNBz5oPn2V9WnD205kiRJk2Gh4ci/mGdfAaccwlokSZImxkLDkY9erEIkSZImyULDkb8x3/6quvDQliNJkjQZFhqO/PV59hVgCJMkSToICw1H/vZiFSJJkjRJFhqOfEZVvSPJ82bbX1Wv7KcsSZKk8bbQcOSdut936bsQSZKkSbLQcOQbu99/sjjlaNydvWkzU3un92tbs2ol27ZuaVSRJEltjLSAd5LjgOcCa4ffU1WnLvC+DcA24DDgzVW1dcb+Y4G3AUd1fTZV1UWjl6/lZmrvNCvWb9y/bef5jaqRJKmdkUIY8D7gLcAHgB+O8oYkhwGvBR4L7AauSLKjqq4b6vbHwAVV9fokJwIXMQh6kiRJY23UEPbdqnr1AX72euD6qroBIMl5wGnAcAgrYGX3+q7A/z3AY0iSJC1Lo4awbUleDHwE+N6+xgXWjlwDfHVoezfw0Bl9XgJ8JMlzGXwJ4DGzfVCSM4EzAY499tgRS5YkSVq6Rg1h9wOeyWCtyH3DkYdi7cjTgXOr6i+S/ALw10nuW1X7DXlW1XZgO8C6devqdh5TkiSpuVFD2FOAe1fV9w/gs6eAY4a2j+7ahp0BbACoqk8lORJYBdx8AMeRJEladu4wYr9rGHyD8UBcAZyQ5LgkRwBPA3bM6PN/gF8GSPLzwJHAngM8jiRJ0rIz6p2wo4B/TnIF+88Jm/MRFVV1S5KzgIsZPH7inKq6NskWYFdV7QCeD7wpyR8wGN58VlU53ChJksbeqCHsxQfz4d0zvy6a0bZ56PV1wCMO5rMlSZKWs5FCWFX9Y9+FSJIkTZJR54RJkiTpEDKESZIkNWAIkyRJamDUBbwfweDp9vfq3hOgqure/ZUmSZI0vkb9duRbgD8ArgRu7a8cSZKkyTBqCPtOVX2410okSZImyKgh7JIkrwAuZPQFvCVJkjSHUUPYQ7vf64baDsUC3pIkSRNp1Ie1PrrvQiRJkibJvCEsyTOq6h1Jnjfb/qp6ZT9lSZIkjbeF7oTdqft9l74LkSRJmiTzhrCqemP3+08WpxxJkqTJMO8T85P8cZK7zbP/lCSPP/RlSZIkjbeFhiM/B3wgyXeBTwN7gCOBE4CTgH8A/rTPAiVJksbRQsOR7wfen+QE4BHAPYFp4B3AmVX1H/2XKEmSNH5GfUTFl4Av9VyLJEnSxJh3TpgkSZL6YQiTJElqYKThyCSPqKpPLNQmabKcvWkzU3unb9O+ZtVKtm3d0qAiSVo+Rl078jXAg0ZokzRBpvZOs2L9xtu27zy/QTWStLwstGzRLwAPB1bPWLpoJXBYn4VJkiSNs4XuhB0B3LnrN7x00TTw5L6KkiRJGncLPSfsH4F/THJuVd20SDVJkiSNvVHnhN0xyXZg7fB7quqUPoqSJEkad6OGsHcDbwDeDNzaXzmSJEmTYdQQdktVvb7XSiRJkibIqA9r/UCS/5bknknutu+n18okSZLG2Kh3wn6r+/3CobYC7n1oy5EkSZoMoy7gfVzfhUiSJE2SUZct+s3Z2qvq7Ye2HEmSpMkw6nDkQ4ZeHwn8MvBpwBAmSZJ0EEYdjnzu8HaSo4Dz+ihIkiRpEoz67ciZ/g1YcJ5Ykg1JvpDk+iSb5ujz1CTXJbk2yTsPsh5JkqRlZdQ5YR9g8G1IGCzc/fPABQu85zDgtcBjgd3AFUl2VNV1Q31OAP4IeERVfSvJ3Q/8FCRJkpafUeeE/fnQ61uAm6pq9wLvWQ9cX1U3ACQ5DzgNuG6oz3OA11bVtwCq6uYR65EkSVrWRhqO7Bby/mfgLsBPAd8f4W1rgK8Obe/u2obdB7hPkk8kuSzJhlHqkSRJWu5GCmFJngrsBJ4CPBW4PMmTD8HxVwAnACcDpwNv6ib9zzz+mUl2Jdm1Z8+eQ3BYSZKktkYdjnwR8JB9w4VJVgP/ALxnnvdMAccMbR/dtQ3bDVxeVT8AbkzyRQah7IrhTlW1HdgOsG7dukKSJGmZG/XbkXeYMV/rGyO89wrghCTHJTkCeBqwY0af9zG4C0aSVQyGJ28YsSZJkqRla9Q7YX+X5GLgXd32RuDD872hqm5JchZwMYNvVJ5TVdcm2QLsqqod3b5fSXIdcCvwwqr6xsGciCRJ0nIy6sNaX5jkN4BHdk3bq+q9I7zvIuCiGW2bh14X8LzuR5IkaWLMG8KS/Axwj6r6RFVdCFzYtT8yyfFV9eXFKFKSJGncLDSv61XA9Czt3+n2SZIk6SAsFMLuUVWfm9nYta3tpSJJkqQJsFAIO2qefT9xCOuQJEmaKAuFsF1JnjOzMcmzgSv7KUmSJGn8LfTtyN8H3pvk6fw4dK0DjgCe2GNdkiRJY23eEFZVXwcenuTRwH275g9V1cd6r0ySJGmMjfqcsEuAS3quRZIkaWKMumyRJEmSDiFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaMIRJkiQ1YAiTJElqYEXrAiRNhrM3bWZq7/R+bWtWrWTb1i2NKpKktgxhkhbF1N5pVqzfuH/bzvMbVSNJ7TkcKUmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhrwERU6JHwGlCRJB8YQpkPCZ0BJknRgHI6UJElqoNcQlmRDki8kuT7Jpnn6PSlJJVnXZz2SJElLRW8hLMlhwGuBxwEnAqcnOXGWfncBzgYu76sWSZKkpabPO2Hrgeur6oaq+j5wHnDaLP1eCrwc+G6PtUiSJC0pfYawNcBXh7Z3d20/kuRBwDFV9aH5PijJmUl2Jdm1Z8+eQ1+pJEnSIms2MT/JHYBXAs9fqG9Vba+qdVW1bvXq1f0XJ0mS1LM+Q9gUcMzQ9tFd2z53Ae4LfDzJV4CHATucnC9JkiZBnyHsCuCEJMclOQJ4GrBj386q+k5VraqqtVW1FrgMOLWqdvVYkyRJ0pLQWwirqluAs4CLgc8DF1TVtUm2JDm1r+NKkiQtB70+Mb+qLgIumtG2eY6+J/dZiyRJ0lLiE/MlSZIaMIRJkiQ1YAiTJElqwBAmSZLUgCFMkiSpAUOYJElSA4YwSZKkBgxhkiRJDRjCJEmSGjCESZIkNWAIkyRJaqDXtSMl6VA4e9NmpvZO79e2ZtVKtm3d0qgiSbr9DGGSlrypvdOsWL9x/7ad5zeqRpIODYcjJUmSGvBO2O3kMIkkSToYhrDbyWESSZJ0MByOlCRJasAQJkmS1IDDkWPOOWuSJC1NhrAx55w1SZKWJocjJUmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAC3gLgLM3bWZq7/Rt2tesWsm2rVsaVCQduNn+PfbfYUlLlSGsoaX0H4ypvdOsWL/xtu07z1/0WqSDNdu/x/47LGmp6jWEJdkAbAMOA95cVVtn7H8e8GzgFmAP8DtVdVOfNS0l/gdDkqTJ1ducsCSHAa8FHgecCJye5MQZ3T4DrKuq+wPvAf6sr3okSZKWkj4n5q8Hrq+qG6rq+8B5wGnDHarqkqr6927zMuDoHuuRJElaMvoMYWuArw5t7+7a5nIG8OHZdiQ5M8muJLv27NlzCEuUJElqY0k8oiLJM4B1wCtm219V26tqXVWtW7169eIWJ0mS1IM+J+ZPAccMbR/dte0nyWOAFwG/VFXf67EeSZKkJaPPO2FXACckOS7JEcDTgB3DHZI8EHgjcGpV3dxjLZIkSUtKbyGsqm4BzgIuBj4PXFBV1ybZkuTUrtsrgDsD705yVZIdc3ycJEnSWOn1OWFVdRFw0Yy2zUOvH9Pn8SVJkpaqJTExX5IkadIYwiRJkhowhEmSJDVgCJMkSWqg14n5krRcnL1pM1N7p/drW7NqJdu2bmlUkaRxZwiTJGBq7zQr1m/cv23n+Y2qkTQJHI6UJElqwDthy4DDJJIkjR9D2DLgMIkkSePH4UhJkqQGDGGSJEkNGMIkSZIaMIRJkiQ1YAiTJElqwG9HStIB8JExkg4VQ5gkHQAfGSPpUHE4UpIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhrw25GStIh8xIWkfQxhkrSIfMSFpH0cjpQkSWrAO2FzcMhAkiT1yRA2B4cMJElSnwxhkrQEeTdeGn+GMElagrwbL40/J+ZLkiQ1YAiTJElqwOFISRozzieTlgdDmCSNGeeTScuDIUySJpR3zKS2eg1hSTYA24DDgDdX1dYZ++8IvB14MPANYGNVfaXPmiRJA7fnjpkBTrr9egthSQ4DXgs8FtgNXJFkR1VdN9TtDOBbVfUzSZ4GvBzYeNtPkyQtJaMGuNnCGhjYJOj3Tth64PqqugEgyXnAacBwCDsNeEn3+j3AXyVJVVWPdUmSFslsYQ284yZBvyFsDfDVoe3dwEPn6lNVtyT5DvDTwN4e65IkLRO3547bqGFt1PcuxjEW6zjL8RgH8v7lIn3ddEryZGBDVT27234m8NCqOmuozzVdn93d9pe7PntnfNaZwJnd5s8CX+il6P2tYnLDoOc+uSb5/Cf53GGyz99zn1yLcf73qqrVs+3o807YFHDM0PbRXdtsfXYnWQHclcEE/f1U1XZge091zirJrqpat5jHXCo898k8d5js85/kc4fJPn/PfTLPHdqff59PzL8COCHJcUmOAJ4G7JjRZwfwW93rJwMfcz6YJEmaBL3dCevmeJ0FXMzgERXnVNW1SbYAu6pqB/AW4K+TXA98k0FQkyRJGnu9Piesqi4CLprRtnno9XeBp/RZw+2wqMOfS4znPrkm+fwn+dxhss/fc59cTc+/t4n5kiRJmlufc8IkSZI0B0PYDEk2JPlCkuuTbGpdz2JL8pUkn0tyVZJdrevpU5JzktzcPSplX9vdkvx9ki91v3+qZY19muP8X5Jkqrv+VyX51ZY19iXJMUkuSXJdkmuTnN21j/31n+fcJ+XaH5lkZ5Kru/P/k679uCSXd3/7z+++UDZW5jn3c5PcOHTtT2pcam+SHJbkM0k+2G03ve6GsCFDSy09DjgROD3JiW2rauLRVXXSBHxt+Vxgw4y2TcBHq+oE4KPd9rg6l9ueP8Bfdtf/pG5e5zi6BXh+VZ0IPAz4ve5/65Nw/ec6d5iMa/894JSqegBwErAhycMYLJv3l1X1M8C3GCyrN27mOneAFw5d+6taFbgIzgY+P7Td9Lobwvb3o6WWqur7wL6lljSGqupSBt/KHXYa8Lbu9duAJyxmTYtpjvOfCFX1tar6dPf6Xxj8UV7DBFz/ec59ItTAv3abh3c/BZzCYPk8GN9rP9e5T4QkRwO/Bry52w6Nr7shbH+zLbU0MX+cOgV8JMmV3UoFk+YeVfW17vX/A+7RsphGzkry2W64cuyG42ZKshZ4IHA5E3b9Z5w7TMi174akrgJuBv4e+DLw7aq6pesytn/7Z557Ve279i/rrv1fJrljuwp79SrgfwA/7LZ/msbX3RCmmR5ZVQ9iMCT7e0ke1bqgVroHB0/M/0vsvB44nsFQxdeAv2haTc+S3Bn4W+D3q2q/herG/frPcu4Tc+2r6taqOonBSi7rgZ9rW9HimXnuSe4L/BGDfwYPAe4G/GG7CvuR5PHAzVV1ZetahhnC9jfKUktjraqmut83A+9l8Adqknw9yT0But83N65nUVXV17s/0j8E3sQYX/8khzMIIX9TVRd2zRNx/Wc790m69vtU1beBS4BfAI7KYPk8mIC//UPnvqEboq6q+h7wVsbz2j8CODXJVxhMNToF2Ebj624I298oSy2NrSR3SnKXfa+BXwGumf9dY2d4Ka3fAt7fsJZFty+AdJ7ImF7/bi7IW4DPV9Urh3aN/fWf69wn6NqvTnJU9/ongMcymBd3CYPl82B8r/1s5/7PQ//HIwzmRI3dta+qP6qqo6tqLYP/tn+sqp5O4+vuw1pn6L6W/Sp+vNTSy9pWtHiS3JvB3S8YrKbwznE+/yTvAk4GVgFfB14MvA+4ADgWuAl4alWN5eT1Oc7/ZAbDUQV8BfivQ3OkxkaSRwL/BHyOH88P+Z8M5kaN9fWf59xPZzKu/f0ZTMA+jMGNiAuqakv39+88BsNxnwGe0d0ZGhvznPvHgNVAgKuA3x2awD92kpwMvKCqHt/6uhvCJEmSGnA4UpIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEk6YEmWzNfX56olyROGFqae7/2/m+Q3D+B4a5P8R5LPJPl8kp1JnjW0/9Qkcy78neSk7lE4kibcioW7SNKy9ATgg8B183WqqjccxGd/uaoeCD96vt6FSVJVb62qHcz/kOeTgHXARQdxXEljxDthkg6JJMcn+btu8fd/SvJzSe6a5KYkd+j63CnJV5McPlv/rs+5SV6d5JNJbkjy5K79nkkuTXJVkmuS/OLQsV+W5OoklyW5R5KHA6cCr+j6Hz/P8V6S5AXd648neXl3d+uLw8eYS1XdADwP+O/dZzwryV91r5/S1Xp1V/sRwBZgY1fXxiTrk3yqu7P2ySQ/O/Q5F3Y1fynJnw2d74Ykn+4+96ND/2zP6Wr/TJLTbu81ldQv74RJOlS2M3jS9peSPBR4XVWdkuQq4JcYLA/yeODiqvpBktv0Z7CeG8A9gUcyWFR4B/Ae4L90731ZksOAn+z63gm4rKpe1AWV51TV/0qyA/hgVb0HoAsrcx1v2IqqWt8NGb4YeMwI5/5pZl8EejPwn6tqKslRVfX9JJuBdVV1VlfXSuAXq+qWJI8B/hR4Uvf+k4AHAt8DvpDkNcB3Gazt+KiqujHJ3bq+L2KwFMvvZLA0zc4k/1BV/zZC/ZIaMIRJut2S3Bl4OPDuwfJzANyx+30+sJFBCHsa8LoF+gO8r1tI+rok9+jargDOyWDx6fdV1VVd+/cZDDsCXMlgPbwDqW+mfYt5XwmsnfOkZxxijvZPAOcmuWDoc2e6K/C2JCcwWDLo8KF9H62q7wAkuQ64F/BTwKVVdSPA0LJKv8JggeIXdNtHMlh+6fMjnoOkRWYIk3Qo3AH4dlWdNMu+HcCfdndsHgx8jMHdq7n6w+DOzz4BqKpLkzwK+DUGweaVVfV24Af14/XXbmX2v2vz1TfXsef6rNk8kFnCTlX9bnfX7deAK5M8eJb3vhS4pKqemGQt8PFZahmlngBPqqovjFizpMacEybpdquqaeDGJE8ByMADun3/yuAu1jYGw4O3ztd/LknuBXy9qt4EvBl40AJl/Qtwl4Xqu7264PTnwGtm2Xd8VV1eVZuBPcAxw3V17gpMda+fNcIhLwMeleS47hj7hiMvBp6b7lZfkgce8MlIWlSGMEkH4yeT7B76eR7wdOCMJFcD1wLDE8PPB57R/d5nvv6zORm4OslnGAxvblug/3nAC7tJ6scfxPHmc3z3uZ8HLgBeXVVvnaXfK5J8Lsk1wCeBqxkMy564b2I+8GfA/+7Oa8E7b1W1BziTwTcyr+bH/0xfymAo87NJru22JS1h+fFdfEmSJC0W74RJkiQ1YAiTJElqwBAmSZLUgCFMkiSpAUOYJElSA4YwSZKkBgxhkiRJDRjCJEmSGvj/J3uw7FpaAhkAAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"plt.figure(figsize=(10, 6))\n",
"ax = sns.histplot(data=str_df[str_df.LevDist <= 40], x=\"LevDist\", bins=100)\n",
"ax.set(xlabel=\"Levenshtein Distance\", ylabel = \"Count (in millions)\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "hundred-bowling",
"metadata": {},
"outputs": [],
"source": [
"# pd.qcut(str_df[str_df.LevDist <= 100]['LevDist'], q=100, retbins=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "quarterly-shock",
"metadata": {},
"outputs": [],
"source": [
"str_df.LevDist[str_df.LevDist <= 20].hist(bins=100).set_title(\"count v/s Lev edit distances till 20\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "entire-candle",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 label \\\n",
" --filter-keys node1 label \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "similar-nevada",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"16922584 ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv"
]
},
{
"cell_type": "markdown",
"id": "administrative-barbados",
"metadata": {},
"source": [
"### Dates Comparison"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "creative-office",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-03-15 01:44:30 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_22_c1.\"node1\", graph_22_c1.\"label\", graph_22_c1.\"node2\", graph_24_c2.\"label\" \"_aLias.newNode2Label\", graph_24_c2.\"node2\" \"_aLias.newNode2\"\n",
" FROM graph_22 AS graph_22_c1, graph_24 AS graph_24_c2\n",
" WHERE graph_22_c1.\"node1\"=graph_24_c2.\"node1\"\n",
" AND (graph_22_c1.\"label\" = graph_24_c2.\"label\")\n",
" PARAS: []\n",
"---------------------------------------------\n",
"[2021-03-15 01:44:30 sqlstore]: CREATE INDEX on table graph_22 column node1 ...\n",
"[2021-03-15 01:44:33 sqlstore]: ANALYZE INDEX on table graph_22 column node1 ...\n",
"[2021-03-15 01:44:34 sqlstore]: CREATE INDEX on table graph_24 column node1 ...\n",
"[2021-03-15 01:45:08 sqlstore]: ANALYZE INDEX on table graph_24 column node1 ...\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_date.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/claims.time.tsv.gz \\\n",
" --match \"newSeg: (x)-[r]->(y), time: (x)-[s]->(z)\" \\\n",
" --where \"r.label = s.label\" \\\n",
" --return 'x, r.label, y, s.label as newNode2Label, z as newNode2' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_date_new_vals_rightone.tsv \\\n",
" --graph-cache ~/temp1.sqlite3.db\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "identified-calculation",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"date_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_date_new_vals_rightone.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "federal-cursor",
"metadata": {},
"outputs": [],
"source": [
"# date_df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "infinite-handbook",
"metadata": {},
"outputs": [],
"source": [
"date_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "established-joining",
"metadata": {},
"outputs": [],
"source": [
"def parseDate(str):\n",
"# try:\n",
" if str == '' or str == \" \": return []\n",
" elems = []\n",
" toFetchI = 1\n",
" dash1 = str.find(\"-\",toFetchI)\n",
" toFetchI = dash1 + 1\n",
" elems.append(int(str[:dash1]))\n",
"\n",
" dash2 = str.find(\"-\",toFetchI)\n",
" toFetchI = dash2 + 1\n",
" elems.append(int(str[dash1+1:dash2]))\n",
"\n",
" dashT = str.find(\"T\",toFetchI)\n",
" toFetchI = dashT + 1\n",
" elems.append(int(str[dash2+1:dashT]))\n",
"\n",
" dashC = str.find(\":\",toFetchI)\n",
" toFetchI = dashC + 1\n",
" elems.append(int(str[dashT+1:dashC]))\n",
"\n",
" dashC2 = str.find(\":\",toFetchI)\n",
" toFetchI = dashC2 + 1\n",
" elems.append(int(str[dashC+1:dashC2]))\n",
"\n",
" dashZ = str.find(\"Z\",toFetchI)\n",
" toFetchI = dashZ + 2\n",
" elems.append(int(str[dashC2+1:dashZ]))\n",
"\n",
" elems.append(int(str[toFetchI:]))\n",
" return elems\n",
"# except:\n",
"# print(str)\n",
"# return []\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "lucky-gossip",
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"def validateDate(elems):\n",
" if len(elems) == 0:\n",
" return False\n",
" precision = elems[-1]\n",
"# assert precision >= 9\n",
" elems = elems[:-1]\n",
" if elems[1] == 0: elems[1] = 1\n",
" if elems[2] == 0: elems[2] = 1\n",
" \n",
" if elems[0] < 1970 or elems[0] > 9999: \n",
" if elems[0] % 400 == 0 or (elems[0] % 4 == 0 and elems[0] % 100 != 0):\n",
" elems[0] = 1972\n",
" else:\n",
" elems[0] = 1970\n",
" if precision < 0 or precision > 14:\n",
" return False\n",
" try:\n",
" datetime.datetime(*elems)\n",
" return True\n",
" except:\n",
" return False\n",
" return status"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "executed-theater",
"metadata": {},
"outputs": [],
"source": [
"validateDate(parseDate(\"1887-00-00T00:00:00Z/9\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "enormous-carpet",
"metadata": {},
"outputs": [],
"source": [
"datetime.datetime(*[1948, 2, 29, 0, 0, 0, 11])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "complete-index",
"metadata": {},
"outputs": [],
"source": [
"date_df['parsed_date'] = date_df['node2'].apply(lambda x: parseDate(x[1:]))\n",
"date_df['parsed_date2'] = date_df['newNode2'].apply(lambda x: parseDate(x[1:]))\n",
"date_df['valid_date'] = date_df['node2'].apply(lambda x: validateDate(parseDate(x[1:])))\n",
"date_df['same_date'] = date_df.apply(lambda p: p.parsed_date == p.parsed_date2, axis=1)\n",
"date_df['str_same_date'] = date_df.apply(lambda p: p.node2 == p.newNode2, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "surface-warehouse",
"metadata": {},
"outputs": [],
"source": [
"len(date_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "diagnostic-satellite",
"metadata": {},
"outputs": [],
"source": [
"date_df[date_df['valid_date'] == False]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "seventh-sister",
"metadata": {},
"outputs": [],
"source": [
"date_df[date_df['same_date']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "failing-mileage",
"metadata": {},
"outputs": [],
"source": [
"print(f\"No. of deleted statements having exact same date in dataset as of 7th December 2020: {sum(date_df['str_same_date'])}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "clean-canon",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"def customTimeDelta(date1,date2):\n",
" try:\n",
"# print(date1,date2)\n",
" if date1[0] > sys.maxint or date2[0] > sys.maxint:\n",
" return None\n",
" if date1 == None or date2 == None:\n",
" return None\n",
" date1 = datetime.datetime(*date1[:-1])\n",
" date2 = datetime.datetime(*date2[:-1])\n",
" timeDelta = date1 - date2\n",
" return timeDelta\n",
" except OverflowError:\n",
" return None\n",
" except TypeError:\n",
" return None\n",
" except:\n",
" return None"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "waiting-thumbnail",
"metadata": {},
"outputs": [],
"source": [
"date_df1 = date_df[(date_df['valid_date'] == True) & (date_df['same_date'] == False)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "superior-gothic",
"metadata": {},
"outputs": [],
"source": [
"date_df1['time_delta'] = date_df1.apply(lambda x: customTimeDelta(x.parsed_date, x.parsed_date2), axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "muslim-stephen",
"metadata": {},
"outputs": [],
"source": [
"date_df1['time_delta']"
]
},
{
"cell_type": "markdown",
"id": "relative-tomorrow",
"metadata": {},
"source": [
"### Numeric Values Comparison"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "revolutionary-mistake",
"metadata": {},
"outputs": [],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n",
" --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:quantity)\" \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "eight-haven",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4323460 ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "unknown-nirvana",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-09 15:19:10 sqlstore]: IMPORT graph directly into table graph_71 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_qty.tsv ...\n",
"[2021-04-09 15:19:30 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_71_c1.\"node1\", graph_71_c1.\"label\", graph_71_c1.\"node2\", graph_51_c2.\"label\" \"_aLias.node2;newLabel\", graph_51_c2.\"node2\" \"_aLias.node2;newVal\"\n",
" FROM graph_51 AS graph_51_c2, graph_71 AS graph_71_c1\n",
" WHERE graph_51_c2.\"node1\"=graph_71_c1.\"node1\"\n",
" AND (graph_71_c1.\"label\" = graph_51_c2.\"label\")\n",
" PARAS: []\n",
"---------------------------------------------\n",
"[2021-04-09 15:19:30 sqlstore]: CREATE INDEX on table graph_71 column node1 ...\n",
"[2021-04-09 15:19:32 sqlstore]: ANALYZE INDEX on table graph_71 column node1 ...\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv \\\n",
" ../../gdrive-kgtk-dump-2020-12-07/claims.quantity.tsv.gz \\\n",
" --match \"non: (x)-[r]->(y), quantity: (x)-[s]->(z)\" \\\n",
" --where \"r.label = s.label\" \\\n",
" --return 'x, r.label, y, s.label as `node2;newLabel`, z as `node2;newVal`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "convertible-softball",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3239699 ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "unlikely-overhead",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "historical-copying",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2021-04-09 15:26:38 sqlstore]: IMPORT graph directly into table graph_72 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv ...\n",
"[2021-04-09 15:29:43 query]: SQL Translation:\n",
"---------------------------------------------\n",
" SELECT graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", graph_72_c1.\"node2;newLabel\" \"_aLias.node2;newLabel\", max(graph_72_c1.\"node2;newVal\") \"_aLias.node2;newValue\", count(graph_72_c1.\"node2;newVal\") \"_aLias.node2;branching\"\n",
" FROM graph_72 AS graph_72_c1\n",
" WHERE graph_72_c1.\"node2;newLabel\"=graph_72_c1.\"node2;newLabel\"\n",
" AND graph_72_c1.\"node2;newVal\"=graph_72_c1.\"node2;newVal\"\n",
" GROUP BY graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", \"_aLias.node2;newLabel\"\n",
" PARAS: []\n",
"---------------------------------------------\n"
]
}
],
"source": [
"!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv \\\n",
" --match \"(node1)-[r]->(node2{newLabel: newLabel, newVal: newValue})\" \\\n",
" --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "waiting-citizenship",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "unlike-huntington",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" node2;newLabel \n",
" node2;newValue \n",
" node2;branching \n",
" \n",
" \n",
" \n",
" \n",
" 2501639 \n",
" Q999961 \n",
" P1082 \n",
" +17243[+17243,+17243] \n",
" P1082 \n",
" +8883 \n",
" 27 \n",
" \n",
" \n",
" 2501640 \n",
" Q999961 \n",
" P1082 \n",
" +6925 \n",
" P1082 \n",
" +8883 \n",
" 27 \n",
" \n",
" \n",
" 2501641 \n",
" Q999961 \n",
" P1082 \n",
" +8653[+8653,+8653] \n",
" P1082 \n",
" +8883 \n",
" 27 \n",
" \n",
" \n",
" 2501642 \n",
" Q999961 \n",
" P2046 \n",
" +23.95Q712226 \n",
" P2046 \n",
" +23.952616Q712226 \n",
" 1 \n",
" \n",
" \n",
" 2501643 \n",
" Q999988 \n",
" P2046 \n",
" +1000[+1000,+1000]Q81292 \n",
" P2046 \n",
" +1000Q81292 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node1 label node2 node2;newLabel \\\n",
"2501639 Q999961 P1082 +17243[+17243,+17243] P1082 \n",
"2501640 Q999961 P1082 +6925 P1082 \n",
"2501641 Q999961 P1082 +8653[+8653,+8653] P1082 \n",
"2501642 Q999961 P2046 +23.95Q712226 P2046 \n",
"2501643 Q999988 P2046 +1000[+1000,+1000]Q81292 P2046 \n",
"\n",
" node2;newValue node2;branching \n",
"2501639 +8883 27 \n",
"2501640 +8883 27 \n",
"2501641 +8883 27 \n",
"2501642 +23.952616Q712226 1 \n",
"2501643 +1000Q81292 1 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.tail()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "confident-carolina",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\r\n",
"P1733\tP4876\t+1014280\tP4876\t+28977\t1\r\n",
"P2040\tP4876\t+34596\tP4876\t+38623\t1\r\n",
"P2349\tP4876\t+12367\tP4876\t+12500\t3\r\n",
"P2427\tP4876\t+95000\tP4876\t+96793\t4\r\n",
"P2518\tP4876\t+11126\tP4876\t+11145\t1\r\n",
"P2725\tP4876\t+2232\tP4876\t+3907\t1\r\n",
"P2816\tP4876\t+32155\tP4876\t+34149\t2\r\n",
"P3289\tP4876\t+113576\tP4876\t+123199\t1\r\n",
"P3400\tP4876\t+123817\tP4876\t+123817\t4\r\n"
]
}
],
"source": [
"!head ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "adjusted-discretion",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7\n"
]
}
],
"source": [
"import re\n",
"test_str = \"+123817Q\"\n",
"temp = re.search(r'[a-z]', test_str, re.I)\n",
"if temp is not None:\n",
" print(temp.start())\n",
"else:\n",
" print(\"Not found\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "toxic-heart",
"metadata": {},
"outputs": [],
"source": [
"def splitIntoParts(text):\n",
" temp = re.search(r'[a-z]', text, re.I)\n",
" firstAlpha1 = -1 if temp is None else temp.start()\n",
" alpha1 = \"\" if firstAlpha1 == -1 else text[firstAlpha1:]\n",
" text = text if firstAlpha1 == -1 else text[:firstAlpha1]\n",
" \n",
" temp = re.search(r'\\[', text, re.I)\n",
" firstBracket1 = -1 if temp is None else temp.start()\n",
" brack1 = \"\" if firstBracket1 == -1 else text[firstBracket1:]\n",
" \n",
" num1 = text if firstBracket1 == -1 else text[:firstBracket1]\n",
" \n",
" return num1, brack1, alpha1"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "impressed-monthly",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('+1234', '[+1, -1]', 'Q12345')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"splitIntoParts(\"+1234[+1, -1]Q12345\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "sunset-fraction",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c86b1765daec4bc084f0c0f399a69dfd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/2501645 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0mval1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0mval2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mIndexError\u001b[0m: list index out of range"
]
}
],
"source": [
"from dateutil.parser import parse\n",
"import re\n",
"import rltk\n",
"from rltk.similarity import levenshtein_distance as ld\n",
"from nltk.tokenize import word_tokenize as wt\n",
"from tqdm.notebook import tqdm\n",
"\n",
"def is_num(string):\n",
" try: \n",
" float(string)\n",
" return True\n",
"\n",
" except ValueError:\n",
" return False\n",
" \n",
"f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",\"r\").read().split(\"\\n\")\n",
"fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",\"w\")\n",
"firstLine = f1[0]\n",
"\n",
"fNum.write(firstLine+\"\\tNumNE\\tRangeNE\\tNumNRangeNE\\tUnitNE\\n\")\n",
"# fnonQnd.write(f1[0]+\"\\n\")\n",
"\n",
"for i in tqdm(range(1,len(f1))):\n",
" line = f1[i]\n",
" val1 = line.split(\"\\t\")[2]\n",
" val2 = line.split(\"\\t\")[4]\n",
" \n",
" \n",
" num1, brack1, alpha1 = splitIntoParts(val1)\n",
" num2, brack2, alpha2 = splitIntoParts(val2)\n",
" \n",
"# print(val1, num1, brack1, alpha1)\n",
"# print(val2, num2, brack2, alpha2)\n",
" \n",
" fNum.write(line + \"\\t\" + str(num1 != num2) + \"\\t\" + str(brack1 != brack2) + \"\\t\" + str((num1 != num2) and (brack1 != brack2)) + \"\\t\" + str(alpha1 != alpha2) + \"\\n\")\n",
"\n",
"fNum.close()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "impaired-venue",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"num_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",sep='\\t')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "strange-alcohol",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" node1 \n",
" label \n",
" node2 \n",
" node2;newLabel \n",
" node2;newValue \n",
" node2;branching \n",
" NumNE \n",
" RangeNE \n",
" NumNRangeNE \n",
" UnitNE \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" P1733 \n",
" P4876 \n",
" +1014280 \n",
" P4876 \n",
" +28977 \n",
" 1 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 1 \n",
" P2040 \n",
" P4876 \n",
" +34596 \n",
" P4876 \n",
" +38623 \n",
" 1 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 2 \n",
" P2349 \n",
" P4876 \n",
" +12367 \n",
" P4876 \n",
" +12500 \n",
" 3 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 3 \n",
" P2427 \n",
" P4876 \n",
" +95000 \n",
" P4876 \n",
" +96793 \n",
" 4 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
" 4 \n",
" P2518 \n",
" P4876 \n",
" +11126 \n",
" P4876 \n",
" +11145 \n",
" 1 \n",
" True \n",
" False \n",
" False \n",
" False \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" node1 label node2 node2;newLabel node2;newValue node2;branching \\\n",
"0 P1733 P4876 +1014280 P4876 +28977 1 \n",
"1 P2040 P4876 +34596 P4876 +38623 1 \n",
"2 P2349 P4876 +12367 P4876 +12500 3 \n",
"3 P2427 P4876 +95000 P4876 +96793 4 \n",
"4 P2518 P4876 +11126 P4876 +11145 1 \n",
"\n",
" NumNE RangeNE NumNRangeNE UnitNE \n",
"0 True False False False \n",
"1 True False False False \n",
"2 True False False False \n",
"3 True False False False \n",
"4 True False False False "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "hindu-merit",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "hollywood-boring",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 2.501575e+06\n",
"mean 6.733284e+01\n",
"std 5.003042e+02\n",
"min 1.000000e+00\n",
"25% 1.000000e+00\n",
"50% 2.000000e+00\n",
"75% 1.100000e+01\n",
"max 2.132100e+04\n",
"Name: node2;branching, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"num_df['node2;branching'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "moral-history",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Out of 2501575 quantities, there are 1496454 cases where numbers have got updated, 2037283 cases where ranges have got updated, 1069289 cases where number and range both have got updated, 78048 cases were the unit has got updated\n"
]
}
],
"source": [
"print(f\"Out of {len(num_df)} quantities, there are {num_df['NumNE'].sum()} cases where numbers have got updated, {num_df['RangeNE'].sum()} cases where ranges have got updated, {num_df['NumNRangeNE'].sum()} cases where number and range both have got updated, {num_df['UnitNE'].sum()} cases were the unit has got updated\")"
]
},
{
"cell_type": "markdown",
"id": "muslim-dryer",
"metadata": {},
"source": [
"### Qnodes comparison"
]
},
{
"cell_type": "markdown",
"id": "brilliant-picnic",
"metadata": {},
"source": [
"#### Qnodes type segregation\n",
"\n",
"Here, for each qnode to qnode removed statement, we analyze:\n",
"* How many statements have node1 which is an instance/subclass/both of something else\n",
"* How many statements have node2 which is an instance/subclass/both of something else"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "described-america",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "universal-surprise",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "elder-tissue",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "killing-emphasis",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "answering-sheriff",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "intimate-sullivan",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node2 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "surprising-clone",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"15682364 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "innovative-thread",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 3500869 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv\n",
" 3396316 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv\n",
" 14206459 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv\n",
" 21103644 total\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1*"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "accompanied-lighting",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 10064419 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv\n",
" 6622159 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv\n",
" 12057758 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv\n",
" 28744336 total\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2*"
]
},
{
"cell_type": "markdown",
"id": "verified-vegetable",
"metadata": {},
"source": [
"#### Qnodes to Qnodes (instance/subclass analysis)\n",
"\n",
"Here, we analyze how many P31 relations were deleted, how many were updated to P31/P279/nothing. We do the same thing for P279 relations that were deleted"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "quick-welsh",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --return 'count(a)' \\\n",
" --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P31.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "satisfied-philosophy",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P31]->(b)' \\\n",
" --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "southern-daisy",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P279]->(b)' \\\n",
" --return 'count(a)' \\\n",
" --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "subtle-tract",
"metadata": {},
"outputs": [],
"source": [
"!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n",
" --match 'o: (a)-[:P279]->(b)' \\\n",
" --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv"
]
},
{
"cell_type": "markdown",
"id": "opponent-bible",
"metadata": {},
"source": [
"##### Analyze for P31 relations"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "soviet-liverpool",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "imposed-pound",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "provincial-limit",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "dynamic-persian",
"metadata": {},
"outputs": [],
"source": [
"!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n",
" ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "material-routine",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "aboriginal-injection",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3611396 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n",
"2864334 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n",
"150123 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n",
"106540 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n",
"703480 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "perceived-hopkins",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n",
" --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "antique-neighborhood",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "alleged-destiny",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 626925 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv\r\n",
" 76556 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv\r\n",
" 703481 total\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_*"
]
},
{
"cell_type": "markdown",
"id": "opposed-palmer",
"metadata": {},
"source": [
"##### Analyze for P279 relations"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "hybrid-hacker",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "reliable-ontario",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "radio-bumper",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n",
" --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "loving-switzerland",
"metadata": {},
"outputs": [],
"source": [
"!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n",
" ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "prostate-trace",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "subsequent-recovery",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"935667 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n",
"865917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n",
"454917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n",
"421734 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n",
"36568 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n",
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "hazardous-liberal",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"^C\r\n",
"\r\n",
"Keyboard interrupt in ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz --filter-mode NONE --input-keys node1 --filter-keys node1 -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv.\r\n"
]
}
],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n",
" --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n",
" --filter-mode NONE \\\n",
" --input-keys node1 \\\n",
" --filter-keys node1 \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "manual-embassy",
"metadata": {},
"outputs": [],
"source": [
"!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n",
" --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv \\\n",
" -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "determined-wonder",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 35004 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv\r\n",
" 1565 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv\r\n",
" 36569 total\r\n"
]
}
],
"source": [
"!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_*"
]
},
{
"cell_type": "markdown",
"id": "dramatic-spyware",
"metadata": {},
"source": [
"Fin."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "general-hometown",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "kgtkEnv",
"language": "python",
"name": "kgtkenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "288px"
},
"toc_section_display": true,
"toc_window_display": true
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}