{ "cells": [ { "cell_type": "markdown", "id": "statutory-onion", "metadata": {}, "source": [ "# Understanding Removed Statements Dataset\n", "\n", "Source of data: [GDrive | Removed Stataments of Wikidata | Feb 1 2021](https://drive.google.com/file/d/1TQP1rADdvhDjsvBpLzSE9Bx3n73wf-Md/view?usp=sharing)\n", "\n", "Steps performed:\n", "* Divide dataset into 2 halves - redirected and non-redirected. Redirected dataset has either node1 or node2 as redirected. But non-redirected has both node1, node2 not redirected\n", "\n", "\n", "**Summary**\n", "\n", "Removed Statements dataset has 76.5M removed statements. Out of these, " ] }, { "cell_type": "markdown", "id": "christian-mounting", "metadata": {}, "source": [ "## Redirects determination and division of dataset into 2 halves\n", "\n", "* Since, redirects dataset was not present, a SPARQL query was run to determine all the redirects existing at the moment. This was done on Feb 19, 2021. This was executed using [SPARQL query](https://query.wikidata.org/). Query run was:\n", " ```\n", " SELECT ?old_node\n", " WHERE {\n", " ?old_node owl:sameAs ?new_node.\n", " }\n", " ```\n", "* This has few lexemes as well which we don't need. So, I then ran the query:\n", " ```\n", " SELECT ?old_node\n", " WHERE {\n", " ?old_node owl:sameAs ?new_node.\n", " ?new_node rdf:type ontolex:LexicalEntry.\n", " }\n", " ```\n", "* After removing the lexemes from the nodes file, a final redirected non-lexemes file was created with data from Feb 19, 2021: `data/SPARQL_redirects_non-lexemes.tsv`.\n", "* Using this reduced dataset, I was able to determine in the removed_statements.tsv dataset, which nodes have been redirected - `../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`. This has removed statements in which either node1 or node2 is redirected.\n", "* After this, I am extracting the removed statements not present in this subset meaning it would correspond to all removed statements in neither node1 nor node2 is redirected - `../opAnalysis/removed_statements_both_nonredirects.tsv`\n", "\n", "For this, I am using the following set of commands" ] }, { "cell_type": "code", "execution_count": 2, "id": "thick-absorption", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "id": "boolean-string", "metadata": {}, "outputs": [], "source": [ "# On the basis of SPARQL\n", "!kgtk ifexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_redirects_basis_node1.tsv\n", "!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_basis_node1.tsv\n", "!kgtk ifexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_redirects_basis_node2.tsv\n", "!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_basis_node2.tsv\n", "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_redirects_basis_node1.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n", " -o ../../opAnalysis/temp1.tsv\n", "!kgtk cat -i ../../opAnalysis/temp1.tsv \\\n", " ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n", " -o ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv\n", "!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects.tsv" ] }, { "cell_type": "markdown", "id": "committed-volunteer", "metadata": {}, "source": [ "## P31 edges distribution" ] }, { "cell_type": "markdown", "id": "objective-range", "metadata": {}, "source": [ "Now, we'll determine in this redirected dataset - `../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`, how many of these are P31 edges and determine more stats on these" ] }, { "cell_type": "markdown", "id": "final-fraud", "metadata": {}, "source": [ "### For Redirected Removed Statements" ] }, { "cell_type": "code", "execution_count": null, "id": "analyzed-silicon", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --return 'b, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_redirects_P31_stats1.tsv" ] }, { "cell_type": "code", "execution_count": 7, "id": "smaller-eugene", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
Q41678365262070.213808
Q173292593013590.122448
Q52228090.090531
Q41674101085830.044119
Q134428141011560.041102
Q7187882310.035850
Q11266439610070.024788
Q4423781536710.021808
Q17143521515810.020958
Q15917122506420.020577
Q486972492570.020014
Q16521465220.018903
Q318267220.010858
Q532237210.009638
Q20900710234820.009541
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "Q4167836 526207 0.213808\n", "Q17329259 301359 0.122448\n", "Q5 222809 0.090531\n", "Q4167410 108583 0.044119\n", "Q13442814 101156 0.041102\n", "Q7187 88231 0.035850\n", "Q11266439 61007 0.024788\n", "Q4423781 53671 0.021808\n", "Q17143521 51581 0.020958\n", "Q15917122 50642 0.020577\n", "Q486972 49257 0.020014\n", "Q16521 46522 0.018903\n", "Q318 26722 0.010858\n", "Q532 23721 0.009638\n", "Q20900710 23482 0.009541" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_P31_stats1.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "japanese-upgrade", "metadata": {}, "source": [ "Find unique list of redirected nodes" ] }, { "cell_type": "code", "execution_count": 9, "id": "former-hudson", "metadata": {}, "outputs": [], "source": [ "!kgtk unique -i ../../opAnalysis/removed_statements_redirects_basis_node1.tsv --column node1 -o ../../opAnalysis/removed_statements_redirects_basis_node1_nodes_only.tsv" ] }, { "cell_type": "code", "execution_count": 20, "id": "circular-heritage", "metadata": {}, "outputs": [], "source": [ "!kgtk unique -i ../../opAnalysis/removed_statements_redirects_basis_node2.tsv --column node2 -o ../../opAnalysis/removed_statements_redirects_basis_node2_nodes_only.tsv" ] }, { "cell_type": "code", "execution_count": 21, "id": "irish-envelope", "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i ../../opAnalysis/removed_statements_redirects_basis_node1_nodes_only.tsv \\\n", " ../../opAnalysis/removed_statements_redirects_basis_node2_nodes_only.tsv \\\n", " -o ../../opAnalysis/removed_statements_redirects_nodes_only.tsv" ] }, { "cell_type": "code", "execution_count": 22, "id": "bridal-effort", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_redirects_nodes_only.tsv \\\n", " --match '(node1)-[label]->(node2)' \\\n", " --return 'node1, label.label, sum(node2)' \\\n", " -o ../../opAnalysis/removed_statements_redirects_nodes_only_unique.tsv" ] }, { "cell_type": "code", "execution_count": 23, "id": "accomplished-wallpaper", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1864249 ../../opAnalysis/removed_statements_redirects_nodes_only_unique.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_redirects_nodes_only_unique.tsv" ] }, { "cell_type": "markdown", "id": "suburban-cosmetic", "metadata": {}, "source": [ "### For non-redirected removed statements" ] }, { "cell_type": "code", "execution_count": null, "id": "characteristic-still", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --return 'b, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv" ] }, { "cell_type": "code", "execution_count": 9, "id": "subsequent-dutch", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
Q41678363688880.102453
Q41674101324030.036773
Q51302520.036176
Q5711268830.035240
Q112664391258240.034946
Q8389481199280.033308
Q4869721081050.030025
Q5321067860.029658
Q7837941011210.028085
Q1539532781860.021715
Q916333627890.017439
Q16521534020.014832
Q7366450050.012499
Q13406463425820.011827
Q18593264405050.011250
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "Q4167836 368888 0.102453\n", "Q4167410 132403 0.036773\n", "Q5 130252 0.036176\n", "Q571 126883 0.035240\n", "Q11266439 125824 0.034946\n", "Q838948 119928 0.033308\n", "Q486972 108105 0.030025\n", "Q532 106786 0.029658\n", "Q783794 101121 0.028085\n", "Q1539532 78186 0.021715\n", "Q916333 62789 0.017439\n", "Q16521 53402 0.014832\n", "Q7366 45005 0.012499\n", "Q13406463 42582 0.011827\n", "Q18593264 40505 0.011250" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "whole-influence", "metadata": {}, "source": [ "## Properties Distribution" ] }, { "cell_type": "markdown", "id": "international-conditioning", "metadata": {}, "source": [ "### For redirected removed statements" ] }, { "cell_type": "code", "execution_count": null, "id": "known-moore", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n", " --match 'o: (a)-[r]->(b)' \\\n", " --return 'r.label, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_redirects_props_dist.tsv" ] }, { "cell_type": "code", "execution_count": 6, "id": "unlikely-default", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
P3123810720.234921
P173572860.035251
P14332994640.029546
P7352957780.029182
P502684120.026482
P28602436070.024035
P6252277790.022473
P1061851840.018271
P1311837590.018130
P211790690.017667
P9211677230.016548
P2791623940.016022
P15661602130.015807
P6841526950.015065
P7031191820.011759
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "P31 2381072 0.234921\n", "P17 357286 0.035251\n", "P1433 299464 0.029546\n", "P735 295778 0.029182\n", "P50 268412 0.026482\n", "P2860 243607 0.024035\n", "P625 227779 0.022473\n", "P106 185184 0.018271\n", "P131 183759 0.018130\n", "P21 179069 0.017667\n", "P921 167723 0.016548\n", "P279 162394 0.016022\n", "P1566 160213 0.015807\n", "P684 152695 0.015065\n", "P703 119182 0.011759" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_props_dist.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "satisfactory-future", "metadata": {}, "source": [ "### For non-redirected removed statements" ] }, { "cell_type": "code", "execution_count": null, "id": "seasonal-composite", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " --match 'o: (a)-[r]->(b)' \\\n", " --return 'r.label, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_props_dist.tsv" ] }, { "cell_type": "code", "execution_count": 11, "id": "straight-haiti", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
P209361733930.161314
P147642384870.110754
P3133276440.086953
P56920115390.052563
P62514944100.039050
P57711163280.029170
P2349995220.026118
P5709832010.025692
P1319274130.024234
P3648702240.022739
P20447808700.020405
P2797651120.019993
P9697324610.019140
P3564134390.010803
P6373870910.010115
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "P2093 6173393 0.161314\n", "P1476 4238487 0.110754\n", "P31 3327644 0.086953\n", "P569 2011539 0.052563\n", "P625 1494410 0.039050\n", "P577 1116328 0.029170\n", "P234 999522 0.026118\n", "P570 983201 0.025692\n", "P131 927413 0.024234\n", "P364 870224 0.022739\n", "P2044 780870 0.020405\n", "P279 765112 0.019993\n", "P969 732461 0.019140\n", "P356 413439 0.010803\n", "P637 387091 0.010115" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_props_dist.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "martial-friday", "metadata": {}, "source": [ "# Comparison Removed NR dataset with Qnodes, literals\n", "\n", "First, let's split this dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "engaging-salon", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:wikibase\\-item)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_qnode.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "closed-toyota", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:quantity)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_qty.tsv\n", "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:string)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str.tsv\n", "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:`wikibase-item`)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_qnode.tsv\n", "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:time)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_date.tsv\n" ] }, { "cell_type": "markdown", "id": "rough-emerald", "metadata": {}, "source": [ "### String Comparison" ] }, { "cell_type": "code", "execution_count": 3, "id": "amateur-effort", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "^C\r\n" ] } ], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/claims.string.tsv.gz \\\n", " --match \"r: (x)-[r]->(y), c: (x)-[s]->(z)\" \\\n", " --where \"r.label = s.label\" \\\n", " --return 'x as `node1`, r.label as `label`, y as `node2`, s.label as `node2;newLabl`, z as `node2;nw`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv \\\n", " --graph-cache ~/temp2.sqlite3.db" ] }, { "cell_type": "code", "execution_count": null, "id": "separate-georgia", "metadata": {}, "outputs": [], "source": [ "# !sed -i '1s/.*/node1\\tlabel\\tnode2\\tnode2;newLabl\\tnode2;nw/' removed_statements_both_nonredirects_newSeg_str_new_vals.tsv" ] }, { "cell_type": "markdown", "id": "disturbed-geology", "metadata": {}, "source": [ "The strings subset has a branching factor of approx 10. i.e. 1 removed statement with string literal has been replaced by around 10 new statements (with same node1-label combination). Doing the same comparisons won't give us much insights. Instead, let's truncate this dataset while retaining just the counts of branching factor from each of these node1-label combinations. " ] }, { "cell_type": "code", "execution_count": null, "id": "downtown-alabama", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-12 08:48:21 sqlstore]: IMPORT graph directly into table graph_1 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv ...\n", "[2021-04-12 09:25:32 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\", graph_1_c1.\"node2;newLabl\" \"_aLias.node2;newLabel\", max(graph_1_c1.\"node2;nw\") \"_aLias.node2;newValue\", count(graph_1_c1.\"node2;nw\") \"_aLias.node2;branching\"\n", " FROM graph_1 AS graph_1_c1\n", " WHERE graph_1_c1.\"node2;newLabl\"=graph_1_c1.\"node2;newLabl\"\n", " AND graph_1_c1.\"node2;nw\"=graph_1_c1.\"node2;nw\"\n", " GROUP BY graph_1_c1.\"node1\", graph_1_c1.\"label\", graph_1_c1.\"node2\", \"_aLias.node2;newLabel\"\n", " PARAS: []\n", "---------------------------------------------\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv \\\n", " --match \"(node1)-[r]->(node2{newLabl: newLabel, nw: newValue})\" \\\n", " --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv \\\n", " --graph-cache ~/sqlite3_caches/temptrunc.sqlite3.db" ] }, { "cell_type": "markdown", "id": "tropical-cooperation", "metadata": {}, "source": [ "On this truncated dataset, we will next compute the stats and comparisons. Note: Our original string literals subset of removed statements was around 9 GB. With the join operation with claims, this had increased to 90 GB. We have now truncated this dataset to 778 MB" ] }, { "cell_type": "code", "execution_count": 8, "id": "meaning-closure", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14349490 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "crude-denmark", "metadata": {}, "outputs": [], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured.tsv" ] }, { "cell_type": "code", "execution_count": 12, "id": "white-valuation", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\r\n", "P1003\tP1630\thttp://alephnew.bibnat.ro:8991/F?func=find-b&request=$1&find_code=SYS&adjacent=Y&local_base=NLR10\tP1630\t\"http://aleph.bibnat.ro:8991/F/?func=direct&local_base=NLR10&doc_number=$1\"\t1\r\n", "P1004\tP1921\thttp://musicbrainz.org/$1/place\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n", "P1004\tP1921\thttps://musicbrainz.org/place/$1\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n", "P1005\tP1630\thttp://purl.pt/index/geral/aut/PT/$1.html\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n", "P1005\tP1630\thttp://urn.bn.pt/nca/unimarc-authorities/txt?id=$1\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n", "P1006\tP1630\thttp://data.bibliotheken.nl/id/thes/p$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n", "P1006\tP1630\thttp://opc4.kb.nl/DB=1/XMLPRS=Y/PPN?PPN=$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n", "P1006\tP1630\thttp://opc4.kb.nl/PPN?PPN=$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n", "P1006\tP1630\thttps://data.bibliotheken.nl/doc/thes/p$1\tP1630\t\"https://opc-kb.oclc.org/PPN?PPN=$1\"\t3\r\n" ] } ], "source": [ "!head ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv" ] }, { "cell_type": "code", "execution_count": 12, "id": "successful-singer", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "17865317d0014ed9bed573ef559e6d8c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from dateutil.parser import parse\n", "import re\n", "import rltk\n", "from rltk.similarity import levenshtein_distance as ld\n", "from nltk.tokenize import word_tokenize as wt\n", "from tqdm.notebook import tqdm\n", "\n", "f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_truncated.tsv\",\"r\")\n", "fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv\",\"w\")\n", "\n", "firstLine = next(f1).rstrip()\n", "\n", "fStr.write(firstLine+\"\\tVersionBool\\tRangeBool\\tLevDist\\tRearranged\\tRearrangedFirstNP\\n\")\n", "\n", "for line in tqdm(f1):\n", " line = line.rstrip()\n", " val1 = line.split(\"\\t\")[2]\n", " val2 = line.split(\"\\t\")[4]\n", " val2 = val2[1:-1]\n", " versionBool = bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",val1))\n", " rangeBool = bool(re.fullmatch(\"[\\d]+[-|–][\\d]+\",val1))\n", " LevDist = ld(val1,val2)\n", " rearranged = set(wt(val1)) == set(wt(val2))\n", " rearrangedFirstNP = set(wt(val1)) == set(wt(val2[1:]))\n", " fStr.write(line+ \"\\t\" + str(versionBool) + \"\\t\" + str(rangeBool) + \"\\t\" + \\\n", " str(LevDist) + \"\\t\" + str(rearranged) + \"\\t\" + str(rearrangedFirstNP) + \"\\n\")\n", "\n", "fStr.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "international-violation", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys label node1 \\\n", " --filter-keys label node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_unmatched2.tsv" ] }, { "cell_type": "code", "execution_count": 13, "id": "tracked-carroll", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1927007651 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals.tsv" ] }, { "cell_type": "code", "execution_count": 14, "id": "vocational-pound", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14349490 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv" ] }, { "cell_type": "code", "execution_count": 4, "id": "trained-tuning", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_unmatched.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_unmatched.tsv" ] }, { "cell_type": "code", "execution_count": 15, "id": "economic-friday", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\tVersionBool\tRangeBool\tLevDist\tRearranged\tRearrangedFirstNP\r\n", "P1003\tP1630\thttp://alephnew.bibnat.ro:8991/F?func=find-b&request=$1&find_code=SYS&adjacent=Y&local_base=NLR10\tP1630\t\"http://aleph.bibnat.ro:8991/F/?func=direct&local_base=NLR10&doc_number=$1\"\t1\r\n", "\tFalse\tFalse\t51\tFalse\tFalse\r\n", "P1004\tP1921\thttp://musicbrainz.org/$1/place\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n", "\tFalse\tFalse\t6\tFalse\tFalse\r\n", "P1004\tP1921\thttps://musicbrainz.org/place/$1\tP1921\t\"http://musicbrainz.org/place/$1\"\t1\r\n", "\tFalse\tFalse\t1\tFalse\tFalse\r\n", "P1005\tP1630\thttp://purl.pt/index/geral/aut/PT/$1.html\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n", "\tFalse\tFalse\t31\tFalse\tFalse\r\n", "P1005\tP1630\thttp://urn.bn.pt/nca/unimarc-authorities/txt?id=$1\tP1630\t\"http://urn.bn.pt/nca/unimarc-authorities/html?id=$1\"\t3\r\n" ] } ], "source": [ "!head ../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured.tsv" ] }, { "cell_type": "code", "execution_count": 1, "id": "daily-complexity", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "str_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_str_new_vals_measured2.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 2, "id": "otherwise-bones", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node2;newLabelnode2;newValuenode2;branchingVersionBoolRangeBoolLevDistRearrangedRearrangedFirstNP
0P1003P1630http://alephnew.bibnat.ro:8991/F?func=find-b&r...P1630http://aleph.bibnat.ro:8991/F/?func=direct&loc...1FalseFalse51FalseFalse
1P1004P1921http://musicbrainz.org/$1/placeP1921http://musicbrainz.org/place/$11FalseFalse6FalseFalse
2P1004P1921https://musicbrainz.org/place/$1P1921http://musicbrainz.org/place/$11FalseFalse1FalseFalse
3P1005P1630http://purl.pt/index/geral/aut/PT/$1.htmlP1630http://urn.bn.pt/nca/unimarc-authorities/html?...3FalseFalse31FalseFalse
4P1005P1630http://urn.bn.pt/nca/unimarc-authorities/txt?i...P1630http://urn.bn.pt/nca/unimarc-authorities/html?...3FalseFalse3FalseFalse
\n", "
" ], "text/plain": [ " node1 label node2 \\\n", "0 P1003 P1630 http://alephnew.bibnat.ro:8991/F?func=find-b&r... \n", "1 P1004 P1921 http://musicbrainz.org/$1/place \n", "2 P1004 P1921 https://musicbrainz.org/place/$1 \n", "3 P1005 P1630 http://purl.pt/index/geral/aut/PT/$1.html \n", "4 P1005 P1630 http://urn.bn.pt/nca/unimarc-authorities/txt?i... \n", "\n", " node2;newLabel node2;newValue \\\n", "0 P1630 http://aleph.bibnat.ro:8991/F/?func=direct&loc... \n", "1 P1921 http://musicbrainz.org/place/$1 \n", "2 P1921 http://musicbrainz.org/place/$1 \n", "3 P1630 http://urn.bn.pt/nca/unimarc-authorities/html?... \n", "4 P1630 http://urn.bn.pt/nca/unimarc-authorities/html?... \n", "\n", " node2;branching VersionBool RangeBool LevDist Rearranged \\\n", "0 1 False False 51 False \n", "1 1 False False 6 False \n", "2 1 False False 1 False \n", "3 3 False False 31 False \n", "4 3 False False 3 False \n", "\n", " RearrangedFirstNP \n", "0 False \n", "1 False \n", "2 False \n", "3 False \n", "4 False " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df.head()" ] }, { "cell_type": "code", "execution_count": 32, "id": "mounted-saint", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "62146" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(str_df[str_df['LevDist'] == 0])" ] }, { "cell_type": "code", "execution_count": 5, "id": "senior-custom", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",\"http://purl.pt/index/geral/aut/PT/$1.html\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "restricted-locking", "metadata": {}, "outputs": [], "source": [ "str_df['node2;branching'].mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "hundred-entrepreneur", "metadata": {}, "outputs": [], "source": [ "str_df['node2;branching'].value_counts().sort_index()" ] }, { "cell_type": "code", "execution_count": 3, "id": "secret-contest", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14349489" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df['node2;branching'].value_counts().sum()" ] }, { "cell_type": "code", "execution_count": 4, "id": "editorial-romance", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 14349489 updates, 254884 correspond to changes due to version change with average branching factor: 1.7222579683306916\n" ] } ], "source": [ "print(f\"Out of {len(str_df)} updates, {str_df['VersionBool'].sum()} correspond to changes due to version change with average branching factor: {str_df[str_df['VersionBool'] == True]['node2;branching'].mean()}\")" ] }, { "cell_type": "code", "execution_count": 5, "id": "social-plenty", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 254884.000000\n", "mean 3.783427\n", "std 3.277387\n", "min 0.000000\n", "25% 2.000000\n", "50% 3.000000\n", "75% 5.000000\n", "max 209.000000\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df[str_df['VersionBool'] == True].LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 6, "id": "promising-hopkins", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 14349489 updates, 321953 correspond to changes due to range change with average branching factor: 1.0656493339089868\n" ] } ], "source": [ "print(f\"Out of {len(str_df)} updates, {str_df['RangeBool'].sum()} correspond to changes due to range change with average branching factor: {str_df[str_df['RangeBool'] == True]['node2;branching'].mean()}\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "varied-reform", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "count 321953.000000\n", "mean 2.343702\n", "std 2.188649\n", "min 0.000000\n", "25% 1.000000\n", "50% 2.000000\n", "75% 3.000000\n", "max 47.000000\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df[str_df['RangeBool'] == True].LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 8, "id": "annoying-transaction", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 14349489 updates, 234286 correspond to changes due to rearrangement with average branching factor: 3.4882536728613744\n" ] } ], "source": [ "print(f\"Out of {len(str_df)} updates, {str_df['Rearranged'].sum()} correspond to changes due to rearrangement with average branching factor: {str_df[str_df['Rearranged'] == True]['node2;branching'].mean()}\")" ] }, { "cell_type": "code", "execution_count": 9, "id": "three-characteristic", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 234286.000000\n", "mean 2.873257\n", "std 2.006146\n", "min 0.000000\n", "25% 0.000000\n", "50% 4.000000\n", "75% 4.000000\n", "max 56.000000\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df[str_df['Rearranged'] == True].LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 10, "id": "military-coordinator", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1.434949e+07\n", "mean 1.153558e+01\n", "std 5.467439e+00\n", "min 0.000000e+00\n", "25% 9.000000e+00\n", "50% 1.200000e+01\n", "75% 1.400000e+01\n", "max 1.445000e+03\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df.LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 11, "id": "european-treat", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'count v/s Lev edit distances')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEICAYAAABPgw/pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAaJ0lEQVR4nO3df5hdVX3v8feHhEDJYPgRO8YkkNCmai7RSqb8KLTOVNTwQ/LcW9qb3BShQtOnNj7eKtUg3IhYW9CLFgGLuZZyhZgRKUKKkdgiU+69CIVUIQQaHCGYREiQQHCAFlO/94+9xuycnplz5mTPnMny83qe8+Tsvdas/T1rzvmcPev8iCICMzPb/x3Q7gLMzKwaDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFAN9tHki6VdFO6fpSkAUkT9mG8zZJOTdc/KumLVdVqeXOgW8vKwbMPYyyW9OWxPu5oiYgfRERHRPw7gKQ+SRfsw3h/HhENf35fj2N5cKBbu50BrG13EWY5cKBnQtJMSbdKelbSc5KuSfsPkHSJpKck7ZD0JUlTUlu3pK0145T/3L9U0s3pZ34saaOkrtR2I3AU8HdpieHDdWp6TNKZpe2Jqb7jBmsD3gHcKelgSTel2l+Q9ICkzhHOwQGSlkv6fhrnZklHpLZvSFpW0/8hSf9liLFOlHRvquUhSd2lttmS/jHNyd8DU0ttsyRFuq2fBH4DuCbN0TVDHOuc9Pt5TtLFNW3l5Zy6czTUcSRdJWmLpBclrZf0GzXj1v3dpva696fU9t70u31e0jpJR6f9kvTZdD97UdIGSccO+0uzakVE2y7A9cAO4JEm+n4W+G66PA680M7ax9MFmAA8lOZoMnAwcEpqey/QDxwDdAC3Ajemtm5ga81Ym4FT0/VLgX8FTk/H+Avgvnp9h6hrBbCqtH0G8Fhp+0Tg2+n6HwJ/BxySjjUfeM0Q49Y9LvAB4D5gBnAQ8AVgdWp7D/D/Sn3nAi8AB9UZZzrwXLrdg086zwGvTe3fBj6TjvGbwI+Bm1LbLCCAiWm7D7hgmDmaCwykcQ5K4+6u+R3c1GiO6h0H+D3gSGAi8CHgGeDgRr9bhr8/LaS4P70pjXsJcG9qexewHjgMUOozrd2Pj5+nS3sPXtyJj6OJQK/5ufcD17d78sbLBTgJeHYwRGra7gLeV9p+A/CT9GDspnGg/0OpbS7wSr2+Q9T1yynsDknbq4AVpfZPAP8jXX8vcC/w5iZub93jAo8Bby9tTyvd1kOBl4CjU9snh7oPAR8hPemV9q0DzqX4q2Q3MLnU9mVaD/QVQG9pezLwKvUDfcg5anSc1Od54C2NfrcN7k/fAM4vbR8AvAwcDfwWxcnWicAB7X5c/Dxe2rrkEhH3ADvL+yT9kqQ705+I/0fSG+v86GJg9ZgUuX+YCTwVEbvrtL0eeKq0/RRFwDW7nPFM6frLwMGSJjbzgxHRTxGy75Z0CHAWRfgNOp096+c3UoRmr6QfSvqUpAObrHHQ0cDX0nLEC+nY/w50RsSPga8Di1LfxRRPMEON8zuD46SxTqF4gng98HxEvFTq/1SdMZr1emDL4EYa97kh+o5ojiRdmJZGdqXbMIXS8hBD/26Huz8dDVxVmpedFGfj0yPiW8A1wLXADkkrJb1muBtv1RqPa+grgfdHxHzgQuDz5ca0Xjcb+FYbahuvtgBHDRG0P6R4EA4aPMPcTnHGeshgg4q32r12BMdt5ruXV1OE50Lg0RTySHodRUD+M0BE/CQiPh4Rc4FfB86kWCYZiS3AaRFxWOlycERsK9ci6SSKZYS7hxnnxppxJkfE5cDTwOGSJpf6HzVMTY3m6GmKAAUgPfEdWXeg4edor+Ok9fIPA78LHB4RhwG7KMK3keHuT1uAP6yZm1+IiHtTjZ9Lj925wK8Af9rE8awi4yrQJXVQ3FG/Kum7FGug02q6LQJuifS2MAPgnyiC4XJJk9OLZyenttXAn6QX8jqAPwe+ks6+Hqc4KzsjneldQrGO26ztFGvzw+kF3gn8EXufnZ8G3BlR/N0uqUfSvPSk8iLFUslPhxn3wHQ7By8TgeuAT5ZepHutpIWln1lL8eR2GcUcDDX+TRR/VbxL0oQ0frekGRHxFPAg8HFJkySdArx7mDobzdEtwJmSTpE0KdVW93HZYI5qj3MoxRP3s8BESSuAZs+Wh7s/XQdcJOk/pZqmSPqddP3XJJ2Q7ksvUazRD/c7tIqNq0CnqOeFiPjV0uVNNX0W4eWWvaQnt3dTrFn/ANgK/NfUfD3Fn+r3AE9SPMjen35uF/A+4IvANooH4V7vemngL4BL0p/fFw5R29MULyL+OvCVUlPt2xVfRxFuL1Islfxjqnsoa4FXSpdLgauANcA3Jf2Y4gXSE0q1/BvFi8KnsveTS23NWyj+ovgoRSBuoTjTHHy8/Lc07k7gY8CXhqnzKuDs9I6Qz9U51kbgj1M9T1Oscw/1OxhujmqPsw64k+JJ+ymK3/uW/zBiHcPdnyLia8AVFMs+LwKPUDw5Q/GE8b/SbXiKYuno080c06qhdILUvgKkWcAdEXFs2r4X+GxEfFWSKF4Aeii1vZHiTjo72l24tSydTT8DHBMRL7a7HrNctPUMXdJqirO3N0jaKul8YAlwvqSHgI0UZ0qDFlG8I8Bhvn87guLdLQ5zswq1/QzdzMyqMd7W0M3MrEVNvZ94NEydOjVmzZrV0s++9NJLTJ48uXHHcWB/qdV1Vst1Vst17rF+/fofRUT9txe36xNN8+fPj1bdfffdLf/sWNtfanWd1XKd1XKdewAPxnj8pKiZmVXHgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWWibR/93xcbtu3ivOVfr9u2+fIzxrgaM7PxoeEZuqTrJe2Q9EiDfr8mabeks6srz8zMmtXMkssNwILhOqT/EusK4JsV1GRmZi1oGOgRcQ/Ff7U1nPcDfwvsqKIoMzMbuab+g4va/yaupm06xf+H2EPx/1feERG3DDHOUmApQGdn5/ze3t6Wit6xcxfbX6nfNm/6lJbGHC0DAwN0dHS0u4yGXGe1XGe1XOcePT096yOiq15bFS+K/iXwkYj4afFfgA4tIlYCKwG6urqiu7u7pQNevep2rtxQv/TNS1obc7T09fXR6u0cS66zWq6zWq6zOVUEehfF/wAOMBU4XdLuiLitgrHNzKxJ+xzoETF78LqkGyiWXG7b13HNzGxkGga6pNVANzBV0lbgY8CBABFx3ahWZ2ZmTWsY6BGxuNnBIuK8farGzMxa5o/+m5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmWgY6JKul7RD0iNDtC+R9LCkDZLulfSW6ss0M7NGmjlDvwFYMEz7k8DbImIe8AlgZQV1mZnZCE1s1CEi7pE0a5j2e0ub9wEzKqjLzMxGSBHRuFMR6HdExLEN+l0IvDEiLhiifSmwFKCzs3N+b2/viAsG2LFzF9tfqd82b/qUlsYcLQMDA3R0dLS7jIZcZ7VcZ7Vc5x49PT3rI6KrXlvDM/RmSeoBzgdOGapPRKwkLcl0dXVFd3d3S8e6etXtXLmhfumbl7Q25mjp6+uj1ds5llxntVxntVxncyoJdElvBr4InBYRz1UxppmZjcw+v21R0lHArcA5EfH4vpdkZmataHiGLmk10A1MlbQV+BhwIEBEXAesAI4EPi8JYPdQ6ztmZjZ6mnmXy+IG7RcAdV8ENTOzseNPipqZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZploGOiSrpe0Q9IjQ7RL0uck9Ut6WNJx1ZdpZmaNNHOGfgOwYJj204A56bIU+Kt9L8vMzEaqYaBHxD3AzmG6LAS+FIX7gMMkTauqQDMza44ionEnaRZwR0QcW6ftDuDyiPi/afsu4CMR8WCdvkspzuLp7Oyc39vb21LRO3buYvsr9dvmTZ/S0pijZWBggI6OjnaX0ZDrrJbrrJbr3KOnp2d9RHTVa5s4qkeuERErgZUAXV1d0d3d3dI4V6+6nSs31C9985LWxhwtfX19tHo7x5LrrJbrrJbrbE4V73LZBswsbc9I+8zMbAxVEehrgPekd7ucCOyKiKcrGNfMzEag4ZKLpNVANzBV0lbgY8CBABFxHbAWOB3oB14Gfn+0ijUzs6E1DPSIWNygPYA/rqwiMzNriT8pamaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZaKpQJe0QNImSf2SltdpP0rS3ZK+I+lhSadXX6qZmQ2nYaBLmgBcC5wGzAUWS5pb0+0S4OaIeCuwCPh81YWamdnwmjlDPx7oj4gnIuJVoBdYWNMngNek61OAH1ZXopmZNUMRMXwH6WxgQURckLbPAU6IiGWlPtOAbwKHA5OBUyNifZ2xlgJLATo7O+f39va2VPSOnbvY/kr9tnnTp7Q05mgZGBigo6Oj3WU05Dqr5Tqr5Tr36OnpWR8RXfXaJlZ0jMXADRFxpaSTgBslHRsRPy13ioiVwEqArq6u6O7ubulgV6+6nSs31C9985LWxhwtfX19tHo7x5LrrJbrrJbrbE4zSy7bgJml7RlpX9n5wM0AEfFt4GBgahUFmplZc5oJ9AeAOZJmS5pE8aLnmpo+PwDeDiDpTRSB/myVhZqZ2fAaBnpE7AaWAeuAxyjezbJR0mWSzkrdPgT8gaSHgNXAedFocd7MzCrV1Bp6RKwF1tbsW1G6/ihwcrWlmZnZSPiTomZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSaaCnRJCyRtktQvafkQfX5X0qOSNkr6crVlmplZIxMbdZA0AbgWeAewFXhA0pqIeLTUZw5wEXByRDwv6RdHq2AzM6uvmTP044H+iHgiIl4FeoGFNX3+ALg2Ip4HiIgd1ZZpZmaNKCKG7yCdDSyIiAvS9jnACRGxrNTnNuBx4GRgAnBpRNxZZ6ylwFKAzs7O+b29vS0VvWPnLra/Ur9t3vQpLY05WgYGBujo6Gh3GQ25zmq5zmq5zj16enrWR0RXvbaGSy5NmgjMAbqBGcA9kuZFxAvlThGxElgJ0NXVFd3d3S0d7OpVt3Plhvqlb17S2pijpa+vj1Zv51hyndVyndVync1pZsllGzCztD0j7SvbCqyJiJ9ExJMUZ+tzqinRzMya0UygPwDMkTRb0iRgEbCmps9tFGfnSJoK/ArwRHVlmplZIw0DPSJ2A8uAdcBjwM0RsVHSZZLOSt3WAc9JehS4G/jTiHhutIo2M7P/qKk19IhYC6yt2beidD2AD6aLmZm1gT8pamaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZaKpQJe0QNImSf2Slg/T77clhaSu6ko0M7NmNAx0SROAa4HTgLnAYklz6/Q7FPgAcH/VRZqZWWPNnKEfD/RHxBMR8SrQCyys0+8TwBXAv1ZYn5mZNUkRMXwH6WxgQURckLbPAU6IiGWlPscBF0fEb0vqAy6MiAfrjLUUWArQ2dk5v7e3t6Wid+zcxfZX6rfNmz6lpTFHy8DAAB0dHe0uoyHXWS3XWS3XuUdPT8/6iKi7rD1xXweXdADwGeC8Rn0jYiWwEqCrqyu6u7tbOubVq27nyg31S9+8pLUxR0tfXx+t3s6x5Dqr5Tqr5Tqb08ySyzZgZml7Rto36FDgWKBP0mbgRGCNXxg1MxtbzQT6A8AcSbMlTQIWAWsGGyNiV0RMjYhZETELuA84q96Si5mZjZ6GgR4Ru4FlwDrgMeDmiNgo6TJJZ412gWZm1pym1tAjYi2wtmbfiiH6du97WWZmNlL+pKiZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpaJpgJd0gJJmyT1S1pep/2Dkh6V9LCkuyQdXX2pZmY2nIaBLmkCcC1wGjAXWCxpbk237wBdEfFm4BbgU1UXamZmw2vmDP14oD8inoiIV4FeYGG5Q0TcHREvp837gBnVlmlmZo0oIobvIJ0NLIiIC9L2OcAJEbFsiP7XAM9ExJ/VaVsKLAXo7Oyc39vb21LRO3buYvsr9dvmTZ/S0pijZWBggI6OjnaX0ZDrrJbrrJbr3KOnp2d9RHTVa5tY5YEk/R7QBbytXntErARWAnR1dUV3d3dLx7l61e1cuaF+6ZuXtDbmaOnr66PV2zmWXGe1XGe1XGdzmgn0bcDM0vaMtG8vkk4FLgbeFhH/Vk15ZmbWrGbW0B8A5kiaLWkSsAhYU+4g6a3AF4CzImJH9WWamVkjDQM9InYDy4B1wGPAzRGxUdJlks5K3T4NdABflfRdSWuGGM7MzEZJU2voEbEWWFuzb0Xp+qkV12VmZiPkT4qamWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZaCrQJS2QtElSv6TlddoPkvSV1H6/pFmVV2pmZsNqGOiSJgDXAqcBc4HFkubWdDsfeD4ifhn4LHBF1YWamdnwJjbR53igPyKeAJDUCywEHi31WQhcmq7fAlwjSRERFdbalFnLv153/+bLzxjjSszMxlYzgT4d2FLa3gqcMFSfiNgtaRdwJPCjcidJS4GlaXNA0qZWigam1o7diNr3N8OIa20T11kt11kt17nH0UM1NBPolYmIlcDKfR1H0oMR0VVBSaNuf6nVdVbLdVbLdTanmRdFtwEzS9sz0r66fSRNBKYAz1VRoJmZNaeZQH8AmCNptqRJwCJgTU2fNcC56frZwLfasX5uZvbzrOGSS1oTXwasAyYA10fERkmXAQ9GxBrgr4EbJfUDOylCfzTt87LNGNpfanWd1XKd1XKdTZBPpM3M8uBPipqZZcKBbmaWif0u0Bt9DcEY1zJT0t2SHpW0UdIH0v4jJP29pO+lfw9P+yXpc6n2hyUdN8b1TpD0HUl3pO3Z6asa+tNXN0xK+9v2VQ6SDpN0i6R/kfSYpJPG43xK+pP0O39E0mpJB4+X+ZR0vaQdkh4p7RvxHEo6N/X/nqRz6x1rFOr8dPrdPyzpa5IOK7VdlOrcJOldpf2jmgn16iy1fUhSSJqatts2nwBExH5zoXhR9vvAMcAk4CFgbhvrmQYcl64fCjxO8fUInwKWp/3LgSvS9dOBbwACTgTuH+N6Pwh8Gbgjbd8MLErXrwP+KF1/H3Bdur4I+MoY1vi/gQvS9UnAYeNtPik+SPck8AuleTxvvMwn8JvAccAjpX0jmkPgCOCJ9O/h6frhY1DnO4GJ6foVpTrnpsf7QcDslAMTxiIT6tWZ9s+keLPIU8DUds9nROx3gX4SsK60fRFwUbvrKtVzO/AOYBMwLe2bBmxK178ALC71/1m/MahtBnAX8FvAHekO96PSg+dnc5vupCel6xNTP41BjVNSUKpm/7iaT/Z8MvqIND93AO8aT/MJzKoJyhHNIbAY+EJp/179RqvOmrb/DKxK1/d6rA/O6VhlQr06Kb7m5C3AZvYEelvnc39bcqn3NQTT21TLXtKf0W8F7gc6I+Lp1PQM0Jmut7P+vwQ+DPw0bR8JvBARu+vUstdXOQCDX+Uw2mYDzwJ/k5aGvihpMuNsPiNiG/A/gR8AT1PMz3rG33yWjXQOx8Nj7b0UZ7sMU09b6pS0ENgWEQ/VNLW1zv0t0MclSR3A3wL/PSJeLLdF8XTc1veGSjoT2BER69tZRxMmUvxp+1cR8VbgJYrlgZ8ZJ/N5OMUX0s0GXg9MBha0s6aRGA9z2Iiki4HdwKp211JL0iHAR4EV7a6l1v4W6M18DcGYknQgRZiviohb0+7tkqal9mnAjrS/XfWfDJwlaTPQS7HschVwmIqvaqitpV1f5bAV2BoR96ftWygCfrzN56nAkxHxbET8BLiVYo7H23yWjXQO2/ZYk3QecCawJD35MEw97ajzlyiezB9Kj6kZwD9Lel2769zfAr2ZryEYM5JE8SnZxyLiM6Wm8lchnEuxtj64/z3plfATgV2lP4NHTURcFBEzImIWxZx9KyKWAHdTfFVDvTrH/KscIuIZYIukN6Rdb6f4muZxNZ8USy0nSjok3QcG6xxX81ljpHO4DninpMPTXyTvTPtGlaQFFEuDZ0XEyzX1L0rvGJoNzAH+iTZkQkRsiIhfjIhZ6TG1leLNEc/Q7vmselF+tC8UryI/TvHK9sVtruUUij9dHwa+my6nU6yP3gV8D/gH4IjUXxT/Wcj3gQ1AVxtq7mbPu1yOoXhQ9ANfBQ5K+w9O2/2p/ZgxrO9XgQfTnN5G8Y6AcTefwMeBfwEeAW6kePfFuJhPYDXF2v5PKMLm/FbmkGINuz9dfn+M6uynWGsefDxdV+p/capzE3Baaf+oZkK9OmvaN7PnRdG2zWdE+KP/Zma52N+WXMzMbAgOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy8f8BVU8FU6OhyzoAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "str_df.LevDist.hist(bins=50).set_title(\"count v/s Lev edit distances\")" ] }, { "cell_type": "code", "execution_count": 12, "id": "dangerous-civilian", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Text(0.5, 0, 'Levenshtein Distance'), Text(0, 0.5, 'Count (in millions)')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmEAAAF+CAYAAADKnc2YAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAgNklEQVR4nO3de7RdZX3u8e8jAWnVSG2iwxPAIMW2DC+oMVq1Fqn2xNaC1kvkqK0tyuk4xUPr5TQ9dkQbjz2xttZovUVFtFYBLWpULLaKpUOFEBQUsCqCHLOPRxJvuzcv4O/8sWZ0ZbMvKyFzv3uv9f2Mscde853vWvM3mbh5nO+75puqQpIkSYvrDq0LkCRJmkSGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWpgWYawJOckuTnJNSP2f2qS65Jcm+SdfdcnSZK0kCzH54QleRTwr8Dbq+q+C/Q9AbgAOKWqvpXk7lV182LUKUmSNJdleSesqi4FvjncluT4JH+X5Mok/5Tk57pdzwFeW1Xf6t5rAJMkSc0tyxA2h+3Ac6vqwcALgNd17fcB7pPkE0kuS7KhWYWSJEmdFa0LOBSS3Bl4OPDuJPua79j9XgGcAJwMHA1cmuR+VfXtRS5TkiTpR8YihDG4o/ftqjppln27gcur6gfAjUm+yCCUXbGI9UmSJO1nLIYjq2qaQcB6CkAGHtDtfh+Du2AkWcVgePKGBmVKkiT9yLIMYUneBXwK+Nkku5OcATwdOCPJ1cC1wGld94uBbyS5DrgEeGFVfaNF3ZIkSfssy0dUSJIkLXfL8k6YJEnScmcIkyRJamDZfTty1apVtXbt2tZlSJIkLejKK6/cW1WrZ9u37ELY2rVr2bVrV+syJEmSFpTkprn2ORwpSZLUgCFMkiSpAUOYJElSA4YwSZKkBgxhkiRJDRjCJEmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ2saF2ANInO3rSZqb3T+7WtWbWSbVu3NKpIkrTYDGFSA1N7p1mxfuP+bTvPb1SNJKkFQ5i0hHnHTJLGlyFMWsK8YyZJ48uJ+ZIkSQ0YwiRJkhowhEmSJDVgCJMkSWqgtxCW5JwkNye5ZoF+D0lyS5In91WLJEnSUtPnnbBzgQ3zdUhyGPBy4CM91iFJkrTk9BbCqupS4JsLdHsu8LfAzX3VIUmStBQ1mxOWZA3wROD1rWqQJElqpeXE/FcBf1hVP1yoY5Izk+xKsmvPnj39VyZJktSzlk/MXweclwRgFfCrSW6pqvfN7FhV24HtAOvWravFLFKSJKkPzUJYVR2373WSc4EPzhbAJEmSxlFvISzJu4CTgVVJdgMvBg4HqKo39HVcSZKk5aC3EFZVpx9A32f1VYckSdJS5BPzJUmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKmBFa0LkJarszdtZmrv9H5ta1atZNvWLY0qkiQtJ4Yw6SBN7Z1mxfqN+7ftPL9RNZKk5cbhSEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaMIRJkiQ14ALe0pg6e9NmpvZO79e2ZtVKtm3d0qgiSdIwQ5g0pqb2TrNi/cb923ae36gaSdJMDkdKkiQ1YAiTJElqwBAmSZLUgCFMkiSpAUOYJElSA72FsCTnJLk5yTVz7H96ks8m+VySTyZ5QF+1SJIkLTV93gk7F9gwz/4bgV+qqvsBLwW291iLJEnSktLbc8Kq6tIka+fZ/8mhzcuAo/uqRZIkaalZKnPCzgA+3LoISZKkxdL8iflJHs0ghD1ynj5nAmcCHHvssYtUmSRJUn+a3glLcn/gzcBpVfWNufpV1faqWldV61avXr14BUqSJPWkWQhLcixwIfDMqvpiqzokSZJa6G04Msm7gJOBVUl2Ay8GDgeoqjcAm4GfBl6XBOCWqlrXVz2SJElLSZ/fjjx9gf3PBp7d1/Glg3H2ps1M7Z3er23NqpVs27qlUUWSpHHVfGK+tJRM7Z1mxfqN+7ftPL9RNZKkcbZUHlEhSZI0UQxhkiRJDRjCJEmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaWNG6AOlQO3vTZqb2Tu/XtmbVSrZt3dKoIkmSbssQprEztXeaFes37t+28/xG1UiSNDuHIyVJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaMIRJkiQ1YAiTJElqwBAmSZLUQG8hLMk5SW5Ocs0c+5Pk1UmuT/LZJA/qqxZJkqSlps87YecCG+bZ/zjghO7nTOD1PdYiSZK0pPQWwqrqUuCb83Q5DXh7DVwGHJXknn3VI0mStJS0nBO2Bvjq0Pburu02kpyZZFeSXXv27FmU4iRJkvq0LCbmV9X2qlpXVetWr17duhxJkqTbrWUImwKOGdo+umuTJEkaey1D2A7gN7tvST4M+E5Vfa1hPZIkSYtmRV8fnORdwMnAqiS7gRcDhwNU1RuAi4BfBa4H/h347b5qkSRJWmp6C2FVdfoC+wv4vb6OL0mStJQti4n5kiRJ42bBO2FJjgQeD/wi8J+A/wCuAT5UVdf2W54kSdJ4mjeEJfkTBgHs48DlwM3AkcB9gK1dQHt+VX225zolSZLGykJ3wnZW1Yvn2PfKJHcHjj3ENUlaRGdv2szU3un92tasWsm2rVsaVSRJk2HeEFZVH5rZluQOwJ2rarqqbmZwd0zSMjW1d5oV6zfu37bz/EbVSNLkGGlifpJ3JlmZ5E4M5oNdl+SF/ZYmSZI0vkb9duSJVTUNPAH4MHAc8My+ipIkSRp3oz4n7PAkhzMIYX9VVT9IUv2VJd2Wc5ckSeNk1BD2RuArwNXApUnuBUzP+w7pEHPukiRpnIwUwqrq1cCrh5puSvLofkqSJEkafyOFsCR3BJ4ErJ3xHseBJEmSDsKow5HvB74DXAl8r79yJEmSJsOoIezoqtrQayWSJEkTZNRHVHwyyf16rUSSJGmCjHon7JHAs5LcyGA4MkBV1f17q0ySJGmMjRrCHtdrFZIkSRNmpOHIqroJOAr49e7nqK5NkiRJB2HUtSPPBv4GuHv3844kz+2zMEmSpHE26nDkGcBDq+rfAJK8HPgU8Jq+CpMkSRpno347MsCtQ9u3dm2SJEk6CKPeCXsrcHmS93bbTwDe0ktFkiRJE2DUtSNfmeTjDB5VAfDbVfWZ3qqSJEkac/OGsCQrq2o6yd2Ar3Q/+/bdraq+2W95kiRJ42mhO2HvBB7PYM3IGmpPt33vnuqSJEkaa/OGsKp6fPf7uMUpR5IkaTIsNBz5oPn2V9WnD205kiRJk2Gh4ci/mGdfAaccwlokSZImxkLDkY9erEIkSZImyULDkb8x3/6quvDQliNJkjQZFhqO/PV59hVgCJMkSToICw1H/vZiFSJJkjRJFhqOfEZVvSPJ82bbX1Wv7KcsSZKk8bbQcOSdut936bsQSZKkSbLQcOQbu99/sjjlaNydvWkzU3un92tbs2ol27ZuaVSRJEltjLSAd5LjgOcCa4ffU1WnLvC+DcA24DDgzVW1dcb+Y4G3AUd1fTZV1UWjl6/lZmrvNCvWb9y/bef5jaqRJKmdkUIY8D7gLcAHgB+O8oYkhwGvBR4L7AauSLKjqq4b6vbHwAVV9fokJwIXMQh6kiRJY23UEPbdqnr1AX72euD6qroBIMl5wGnAcAgrYGX3+q7A/z3AY0iSJC1Lo4awbUleDHwE+N6+xgXWjlwDfHVoezfw0Bl9XgJ8JMlzGXwJ4DGzfVCSM4EzAY499tgRS5YkSVq6Rg1h9wOeyWCtyH3DkYdi7cjTgXOr6i+S/ALw10nuW1X7DXlW1XZgO8C6devqdh5TkiSpuVFD2FOAe1fV9w/gs6eAY4a2j+7ahp0BbACoqk8lORJYBdx8AMeRJEladu4wYr9rGHyD8UBcAZyQ5LgkRwBPA3bM6PN/gF8GSPLzwJHAngM8jiRJ0rIz6p2wo4B/TnIF+88Jm/MRFVV1S5KzgIsZPH7inKq6NskWYFdV7QCeD7wpyR8wGN58VlU53ChJksbeqCHsxQfz4d0zvy6a0bZ56PV1wCMO5rMlSZKWs5FCWFX9Y9+FSJIkTZJR54RJkiTpEDKESZIkNWAIkyRJamDUBbwfweDp9vfq3hOgqure/ZUmSZI0vkb9duRbgD8ArgRu7a8cSZKkyTBqCPtOVX2410okSZImyKgh7JIkrwAuZPQFvCVJkjSHUUPYQ7vf64baDsUC3pIkSRNp1Ie1PrrvQiRJkibJvCEsyTOq6h1Jnjfb/qp6ZT9lSZIkjbeF7oTdqft9l74LkSRJmiTzhrCqemP3+08WpxxJkqTJMO8T85P8cZK7zbP/lCSPP/RlSZIkjbeFhiM/B3wgyXeBTwN7gCOBE4CTgH8A/rTPAiVJksbRQsOR7wfen+QE4BHAPYFp4B3AmVX1H/2XKEmSNH5GfUTFl4Av9VyLJEnSxJh3TpgkSZL6YQiTJElqYKThyCSPqKpPLNQmabKcvWkzU3unb9O+ZtVKtm3d0qAiSVo+Rl078jXAg0ZokzRBpvZOs2L9xtu27zy/QTWStLwstGzRLwAPB1bPWLpoJXBYn4VJkiSNs4XuhB0B3LnrN7x00TTw5L6KkiRJGncLPSfsH4F/THJuVd20SDVJkiSNvVHnhN0xyXZg7fB7quqUPoqSJEkad6OGsHcDbwDeDNzaXzmSJEmTYdQQdktVvb7XSiRJkibIqA9r/UCS/5bknknutu+n18okSZLG2Kh3wn6r+/3CobYC7n1oy5EkSZoMoy7gfVzfhUiSJE2SUZct+s3Z2qvq7Ye2HEmSpMkw6nDkQ4ZeHwn8MvBpwBAmSZJ0EEYdjnzu8HaSo4Dz+ihIkiRpEoz67ciZ/g1YcJ5Ykg1JvpDk+iSb5ujz1CTXJbk2yTsPsh5JkqRlZdQ5YR9g8G1IGCzc/fPABQu85zDgtcBjgd3AFUl2VNV1Q31OAP4IeERVfSvJ3Q/8FCRJkpafUeeE/fnQ61uAm6pq9wLvWQ9cX1U3ACQ5DzgNuG6oz3OA11bVtwCq6uYR65EkSVrWRhqO7Bby/mfgLsBPAd8f4W1rgK8Obe/u2obdB7hPkk8kuSzJhlHqkSRJWu5GCmFJngrsBJ4CPBW4PMmTD8HxVwAnACcDpwNv6ib9zzz+mUl2Jdm1Z8+eQ3BYSZKktkYdjnwR8JB9w4VJVgP/ALxnnvdMAccMbR/dtQ3bDVxeVT8AbkzyRQah7IrhTlW1HdgOsG7dukKSJGmZG/XbkXeYMV/rGyO89wrghCTHJTkCeBqwY0af9zG4C0aSVQyGJ28YsSZJkqRla9Q7YX+X5GLgXd32RuDD872hqm5JchZwMYNvVJ5TVdcm2QLsqqod3b5fSXIdcCvwwqr6xsGciCRJ0nIy6sNaX5jkN4BHdk3bq+q9I7zvIuCiGW2bh14X8LzuR5IkaWLMG8KS/Axwj6r6RFVdCFzYtT8yyfFV9eXFKFKSJGncLDSv61XA9Czt3+n2SZIk6SAsFMLuUVWfm9nYta3tpSJJkqQJsFAIO2qefT9xCOuQJEmaKAuFsF1JnjOzMcmzgSv7KUmSJGn8LfTtyN8H3pvk6fw4dK0DjgCe2GNdkiRJY23eEFZVXwcenuTRwH275g9V1cd6r0ySJGmMjfqcsEuAS3quRZIkaWKMumyRJEmSDiFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAIUySJKkBQ5gkSVIDhjBJkqQGDGGSJEkNGMIkSZIaMIRJkiQ1YAiTJElqYEXrAiRNhrM3bWZq7/R+bWtWrWTb1i2NKpKktgxhkhbF1N5pVqzfuH/bzvMbVSNJ7TkcKUmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhrwERU6JHwGlCRJB8YQpkPCZ0BJknRgHI6UJElqoNcQlmRDki8kuT7Jpnn6PSlJJVnXZz2SJElLRW8hLMlhwGuBxwEnAqcnOXGWfncBzgYu76sWSZKkpabPO2Hrgeur6oaq+j5wHnDaLP1eCrwc+G6PtUiSJC0pfYawNcBXh7Z3d20/kuRBwDFV9aH5PijJmUl2Jdm1Z8+eQ1+pJEnSIms2MT/JHYBXAs9fqG9Vba+qdVW1bvXq1f0XJ0mS1LM+Q9gUcMzQ9tFd2z53Ae4LfDzJV4CHATucnC9JkiZBnyHsCuCEJMclOQJ4GrBj386q+k5VraqqtVW1FrgMOLWqdvVYkyRJ0pLQWwirqluAs4CLgc8DF1TVtUm2JDm1r+NKkiQtB70+Mb+qLgIumtG2eY6+J/dZiyRJ0lLiE/MlSZIaMIRJkiQ1YAiTJElqwBAmSZLUgCFMkiSpAUOYJElSA4YwSZKkBgxhkiRJDRjCJEmSGjCESZIkNWAIkyRJaqDXtSMl6VA4e9NmpvZO79e2ZtVKtm3d0qgiSbr9DGGSlrypvdOsWL9x/7ad5zeqRpIODYcjJUmSGvBO2O3kMIkkSToYhrDbyWESSZJ0MByOlCRJasAQJkmS1IDDkWPOOWuSJC1NhrAx55w1SZKWJocjJUmSGjCESZIkNWAIkyRJasAQJkmS1IAhTJIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEmSJDVgCJMkSWrAECZJktSAC3gLgLM3bWZq7/Rt2tesWsm2rVsaVCQduNn+PfbfYUlLlSGsoaX0H4ypvdOsWL/xtu07z1/0WqSDNdu/x/47LGmp6jWEJdkAbAMOA95cVVtn7H8e8GzgFmAP8DtVdVOfNS0l/gdDkqTJ1ducsCSHAa8FHgecCJye5MQZ3T4DrKuq+wPvAf6sr3okSZKWkj4n5q8Hrq+qG6rq+8B5wGnDHarqkqr6927zMuDoHuuRJElaMvoMYWuArw5t7+7a5nIG8OHZdiQ5M8muJLv27NlzCEuUJElqY0k8oiLJM4B1wCtm219V26tqXVWtW7169eIWJ0mS1IM+J+ZPAccMbR/dte0nyWOAFwG/VFXf67EeSZKkJaPPO2FXACckOS7JEcDTgB3DHZI8EHgjcGpV3dxjLZIkSUtKbyGsqm4BzgIuBj4PXFBV1ybZkuTUrtsrgDsD705yVZIdc3ycJEnSWOn1OWFVdRFw0Yy2zUOvH9Pn8SVJkpaqJTExX5IkadIYwiRJkhowhEmSJDVgCJMkSWqg14n5krRcnL1pM1N7p/drW7NqJdu2bmlUkaRxZwiTJGBq7zQr1m/cv23n+Y2qkTQJHI6UJElqwDthy4DDJJIkjR9D2DLgMIkkSePH4UhJkqQGDGGSJEkNGMIkSZIaMIRJkiQ1YAiTJElqwG9HStIB8JExkg4VQ5gkHQAfGSPpUHE4UpIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhrw25GStIh8xIWkfQxhkrSIfMSFpH0cjpQkSWrAO2FzcMhAkiT1yRA2B4cMJElSnwxhkrQEeTdeGn+GMElagrwbL40/J+ZLkiQ1YAiTJElqwOFISRozzieTlgdDmCSNGeeTScuDIUySJpR3zKS2eg1hSTYA24DDgDdX1dYZ++8IvB14MPANYGNVfaXPmiRJA7fnjpkBTrr9egthSQ4DXgs8FtgNXJFkR1VdN9TtDOBbVfUzSZ4GvBzYeNtPkyQtJaMGuNnCGhjYJOj3Tth64PqqugEgyXnAacBwCDsNeEn3+j3AXyVJVVWPdUmSFslsYQ284yZBvyFsDfDVoe3dwEPn6lNVtyT5DvDTwN4e65IkLRO3547bqGFt1PcuxjEW6zjL8RgH8v7lIn3ddEryZGBDVT27234m8NCqOmuozzVdn93d9pe7PntnfNaZwJnd5s8CX+il6P2tYnLDoOc+uSb5/Cf53GGyz99zn1yLcf73qqrVs+3o807YFHDM0PbRXdtsfXYnWQHclcEE/f1U1XZge091zirJrqpat5jHXCo898k8d5js85/kc4fJPn/PfTLPHdqff59PzL8COCHJcUmOAJ4G7JjRZwfwW93rJwMfcz6YJEmaBL3dCevmeJ0FXMzgERXnVNW1SbYAu6pqB/AW4K+TXA98k0FQkyRJGnu9Piesqi4CLprRtnno9XeBp/RZw+2wqMOfS4znPrkm+fwn+dxhss/fc59cTc+/t4n5kiRJmlufc8IkSZI0B0PYDEk2JPlCkuuTbGpdz2JL8pUkn0tyVZJdrevpU5JzktzcPSplX9vdkvx9ki91v3+qZY19muP8X5Jkqrv+VyX51ZY19iXJMUkuSXJdkmuTnN21j/31n+fcJ+XaH5lkZ5Kru/P/k679uCSXd3/7z+++UDZW5jn3c5PcOHTtT2pcam+SHJbkM0k+2G03ve6GsCFDSy09DjgROD3JiW2rauLRVXXSBHxt+Vxgw4y2TcBHq+oE4KPd9rg6l9ueP8Bfdtf/pG5e5zi6BXh+VZ0IPAz4ve5/65Nw/ec6d5iMa/894JSqegBwErAhycMYLJv3l1X1M8C3GCyrN27mOneAFw5d+6taFbgIzgY+P7Td9Lobwvb3o6WWqur7wL6lljSGqupSBt/KHXYa8Lbu9duAJyxmTYtpjvOfCFX1tar6dPf6Xxj8UV7DBFz/ec59ItTAv3abh3c/BZzCYPk8GN9rP9e5T4QkRwO/Bry52w6Nr7shbH+zLbU0MX+cOgV8JMmV3UoFk+YeVfW17vX/A+7RsphGzkry2W64cuyG42ZKshZ4IHA5E3b9Z5w7TMi174akrgJuBv4e+DLw7aq6pesytn/7Z557Ve279i/rrv1fJrljuwp79SrgfwA/7LZ/msbX3RCmmR5ZVQ9iMCT7e0ke1bqgVroHB0/M/0vsvB44nsFQxdeAv2haTc+S3Bn4W+D3q2q/herG/frPcu4Tc+2r6taqOonBSi7rgZ9rW9HimXnuSe4L/BGDfwYPAe4G/GG7CvuR5PHAzVV1ZetahhnC9jfKUktjraqmut83A+9l8Adqknw9yT0But83N65nUVXV17s/0j8E3sQYX/8khzMIIX9TVRd2zRNx/Wc790m69vtU1beBS4BfAI7KYPk8mIC//UPnvqEboq6q+h7wVsbz2j8CODXJVxhMNToF2Ebj624I298oSy2NrSR3SnKXfa+BXwGumf9dY2d4Ka3fAt7fsJZFty+AdJ7ImF7/bi7IW4DPV9Urh3aN/fWf69wn6NqvTnJU9/ongMcymBd3CYPl82B8r/1s5/7PQ//HIwzmRI3dta+qP6qqo6tqLYP/tn+sqp5O4+vuw1pn6L6W/Sp+vNTSy9pWtHiS3JvB3S8YrKbwznE+/yTvAk4GVgFfB14MvA+4ADgWuAl4alWN5eT1Oc7/ZAbDUQV8BfivQ3OkxkaSRwL/BHyOH88P+Z8M5kaN9fWf59xPZzKu/f0ZTMA+jMGNiAuqakv39+88BsNxnwGe0d0ZGhvznPvHgNVAgKuA3x2awD92kpwMvKCqHt/6uhvCJEmSGnA4UpIkqQFDmCRJUgOGMEmSpAYMYZIkSQ0YwiRJkhowhEk6YEmWzNfX56olyROGFqae7/2/m+Q3D+B4a5P8R5LPJPl8kp1JnjW0/9Qkcy78neSk7lE4kibcioW7SNKy9ATgg8B183WqqjccxGd/uaoeCD96vt6FSVJVb62qHcz/kOeTgHXARQdxXEljxDthkg6JJMcn+btu8fd/SvJzSe6a5KYkd+j63CnJV5McPlv/rs+5SV6d5JNJbkjy5K79nkkuTXJVkmuS/OLQsV+W5OoklyW5R5KHA6cCr+j6Hz/P8V6S5AXd648neXl3d+uLw8eYS1XdADwP+O/dZzwryV91r5/S1Xp1V/sRwBZgY1fXxiTrk3yqu7P2ySQ/O/Q5F3Y1fynJnw2d74Ykn+4+96ND/2zP6Wr/TJLTbu81ldQv74RJOlS2M3jS9peSPBR4XVWdkuQq4JcYLA/yeODiqvpBktv0Z7CeG8A9gUcyWFR4B/Ae4L90731ZksOAn+z63gm4rKpe1AWV51TV/0qyA/hgVb0HoAsrcx1v2IqqWt8NGb4YeMwI5/5pZl8EejPwn6tqKslRVfX9JJuBdVV1VlfXSuAXq+qWJI8B/hR4Uvf+k4AHAt8DvpDkNcB3Gazt+KiqujHJ3bq+L2KwFMvvZLA0zc4k/1BV/zZC/ZIaMIRJut2S3Bl4OPDuwfJzANyx+30+sJFBCHsa8LoF+gO8r1tI+rok9+jargDOyWDx6fdV1VVd+/cZDDsCXMlgPbwDqW+mfYt5XwmsnfOkZxxijvZPAOcmuWDoc2e6K/C2JCcwWDLo8KF9H62q7wAkuQ64F/BTwKVVdSPA0LJKv8JggeIXdNtHMlh+6fMjnoOkRWYIk3Qo3AH4dlWdNMu+HcCfdndsHgx8jMHdq7n6w+DOzz4BqKpLkzwK+DUGweaVVfV24Af14/XXbmX2v2vz1TfXsef6rNk8kFnCTlX9bnfX7deAK5M8eJb3vhS4pKqemGQt8PFZahmlngBPqqovjFizpMacEybpdquqaeDGJE8ByMADun3/yuAu1jYGw4O3ztd/LknuBXy9qt4EvBl40AJl/Qtwl4Xqu7264PTnwGtm2Xd8VV1eVZuBPcAxw3V17gpMda+fNcIhLwMeleS47hj7hiMvBp6b7lZfkgce8MlIWlSGMEkH4yeT7B76eR7wdOCMJFcD1wLDE8PPB57R/d5nvv6zORm4OslnGAxvblug/3nAC7tJ6scfxPHmc3z3uZ8HLgBeXVVvnaXfK5J8Lsk1wCeBqxkMy564b2I+8GfA/+7Oa8E7b1W1BziTwTcyr+bH/0xfymAo87NJru22JS1h+fFdfEmSJC0W74RJkiQ1YAiTJElqwBAmSZLUgCFMkiSpAUOYJElSA4YwSZKkBgxhkiRJDRjCJEmSGvj/J3uw7FpaAhkAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "plt.figure(figsize=(10, 6))\n", "ax = sns.histplot(data=str_df[str_df.LevDist <= 40], x=\"LevDist\", bins=100)\n", "ax.set(xlabel=\"Levenshtein Distance\", ylabel = \"Count (in millions)\")" ] }, { "cell_type": "code", "execution_count": 13, "id": "hundred-bowling", "metadata": {}, "outputs": [], "source": [ "# pd.qcut(str_df[str_df.LevDist <= 100]['LevDist'], q=100, retbins=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "quarterly-shock", "metadata": {}, "outputs": [], "source": [ "str_df.LevDist[str_df.LevDist <= 20].hist(bins=100).set_title(\"count v/s Lev edit distances till 20\")" ] }, { "cell_type": "code", "execution_count": null, "id": "entire-candle", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 label \\\n", " --filter-keys node1 label \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv" ] }, { "cell_type": "code", "execution_count": 35, "id": "similar-nevada", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16922584 ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv" ] }, { "cell_type": "markdown", "id": "administrative-barbados", "metadata": {}, "source": [ "### Dates Comparison" ] }, { "cell_type": "code", "execution_count": 63, "id": "creative-office", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-03-15 01:44:30 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_22_c1.\"node1\", graph_22_c1.\"label\", graph_22_c1.\"node2\", graph_24_c2.\"label\" \"_aLias.newNode2Label\", graph_24_c2.\"node2\" \"_aLias.newNode2\"\n", " FROM graph_22 AS graph_22_c1, graph_24 AS graph_24_c2\n", " WHERE graph_22_c1.\"node1\"=graph_24_c2.\"node1\"\n", " AND (graph_22_c1.\"label\" = graph_24_c2.\"label\")\n", " PARAS: []\n", "---------------------------------------------\n", "[2021-03-15 01:44:30 sqlstore]: CREATE INDEX on table graph_22 column node1 ...\n", "[2021-03-15 01:44:33 sqlstore]: ANALYZE INDEX on table graph_22 column node1 ...\n", "[2021-03-15 01:44:34 sqlstore]: CREATE INDEX on table graph_24 column node1 ...\n", "[2021-03-15 01:45:08 sqlstore]: ANALYZE INDEX on table graph_24 column node1 ...\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_newSeg_date.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/claims.time.tsv.gz \\\n", " --match \"newSeg: (x)-[r]->(y), time: (x)-[s]->(z)\" \\\n", " --where \"r.label = s.label\" \\\n", " --return 'x, r.label, y, s.label as newNode2Label, z as newNode2' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_newSeg_date_new_vals_rightone.tsv \\\n", " --graph-cache ~/temp1.sqlite3.db\n" ] }, { "cell_type": "code", "execution_count": null, "id": "identified-calculation", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "date_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_newSeg_date_new_vals_rightone.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": null, "id": "federal-cursor", "metadata": {}, "outputs": [], "source": [ "# date_df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": null, "id": "infinite-handbook", "metadata": {}, "outputs": [], "source": [ "date_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "established-joining", "metadata": {}, "outputs": [], "source": [ "def parseDate(str):\n", "# try:\n", " if str == '' or str == \" \": return []\n", " elems = []\n", " toFetchI = 1\n", " dash1 = str.find(\"-\",toFetchI)\n", " toFetchI = dash1 + 1\n", " elems.append(int(str[:dash1]))\n", "\n", " dash2 = str.find(\"-\",toFetchI)\n", " toFetchI = dash2 + 1\n", " elems.append(int(str[dash1+1:dash2]))\n", "\n", " dashT = str.find(\"T\",toFetchI)\n", " toFetchI = dashT + 1\n", " elems.append(int(str[dash2+1:dashT]))\n", "\n", " dashC = str.find(\":\",toFetchI)\n", " toFetchI = dashC + 1\n", " elems.append(int(str[dashT+1:dashC]))\n", "\n", " dashC2 = str.find(\":\",toFetchI)\n", " toFetchI = dashC2 + 1\n", " elems.append(int(str[dashC+1:dashC2]))\n", "\n", " dashZ = str.find(\"Z\",toFetchI)\n", " toFetchI = dashZ + 2\n", " elems.append(int(str[dashC2+1:dashZ]))\n", "\n", " elems.append(int(str[toFetchI:]))\n", " return elems\n", "# except:\n", "# print(str)\n", "# return []\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "lucky-gossip", "metadata": {}, "outputs": [], "source": [ "import datetime\n", "def validateDate(elems):\n", " if len(elems) == 0:\n", " return False\n", " precision = elems[-1]\n", "# assert precision >= 9\n", " elems = elems[:-1]\n", " if elems[1] == 0: elems[1] = 1\n", " if elems[2] == 0: elems[2] = 1\n", " \n", " if elems[0] < 1970 or elems[0] > 9999: \n", " if elems[0] % 400 == 0 or (elems[0] % 4 == 0 and elems[0] % 100 != 0):\n", " elems[0] = 1972\n", " else:\n", " elems[0] = 1970\n", " if precision < 0 or precision > 14:\n", " return False\n", " try:\n", " datetime.datetime(*elems)\n", " return True\n", " except:\n", " return False\n", " return status" ] }, { "cell_type": "code", "execution_count": null, "id": "executed-theater", "metadata": {}, "outputs": [], "source": [ "validateDate(parseDate(\"1887-00-00T00:00:00Z/9\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "enormous-carpet", "metadata": {}, "outputs": [], "source": [ "datetime.datetime(*[1948, 2, 29, 0, 0, 0, 11])" ] }, { "cell_type": "code", "execution_count": null, "id": "complete-index", "metadata": {}, "outputs": [], "source": [ "date_df['parsed_date'] = date_df['node2'].apply(lambda x: parseDate(x[1:]))\n", "date_df['parsed_date2'] = date_df['newNode2'].apply(lambda x: parseDate(x[1:]))\n", "date_df['valid_date'] = date_df['node2'].apply(lambda x: validateDate(parseDate(x[1:])))\n", "date_df['same_date'] = date_df.apply(lambda p: p.parsed_date == p.parsed_date2, axis=1)\n", "date_df['str_same_date'] = date_df.apply(lambda p: p.node2 == p.newNode2, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "surface-warehouse", "metadata": {}, "outputs": [], "source": [ "len(date_df)" ] }, { "cell_type": "code", "execution_count": null, "id": "diagnostic-satellite", "metadata": {}, "outputs": [], "source": [ "date_df[date_df['valid_date'] == False]" ] }, { "cell_type": "code", "execution_count": null, "id": "seventh-sister", "metadata": {}, "outputs": [], "source": [ "date_df[date_df['same_date']]" ] }, { "cell_type": "code", "execution_count": null, "id": "failing-mileage", "metadata": {}, "outputs": [], "source": [ "print(f\"No. of deleted statements having exact same date in dataset as of 7th December 2020: {sum(date_df['str_same_date'])}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "clean-canon", "metadata": {}, "outputs": [], "source": [ "import sys\n", "def customTimeDelta(date1,date2):\n", " try:\n", "# print(date1,date2)\n", " if date1[0] > sys.maxint or date2[0] > sys.maxint:\n", " return None\n", " if date1 == None or date2 == None:\n", " return None\n", " date1 = datetime.datetime(*date1[:-1])\n", " date2 = datetime.datetime(*date2[:-1])\n", " timeDelta = date1 - date2\n", " return timeDelta\n", " except OverflowError:\n", " return None\n", " except TypeError:\n", " return None\n", " except:\n", " return None" ] }, { "cell_type": "code", "execution_count": null, "id": "waiting-thumbnail", "metadata": {}, "outputs": [], "source": [ "date_df1 = date_df[(date_df['valid_date'] == True) & (date_df['same_date'] == False)]" ] }, { "cell_type": "code", "execution_count": null, "id": "superior-gothic", "metadata": {}, "outputs": [], "source": [ "date_df1['time_delta'] = date_df1.apply(lambda x: customTimeDelta(x.parsed_date, x.parsed_date2), axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "muslim-stephen", "metadata": {}, "outputs": [], "source": [ "date_df1['time_delta']" ] }, { "cell_type": "markdown", "id": "relative-tomorrow", "metadata": {}, "source": [ "### Numeric Values Comparison" ] }, { "cell_type": "code", "execution_count": null, "id": "revolutionary-mistake", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:quantity)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "eight-haven", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4323460 ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv" ] }, { "cell_type": "code", "execution_count": 2, "id": "unknown-nirvana", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-09 15:19:10 sqlstore]: IMPORT graph directly into table graph_71 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_qty.tsv ...\n", "[2021-04-09 15:19:30 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_71_c1.\"node1\", graph_71_c1.\"label\", graph_71_c1.\"node2\", graph_51_c2.\"label\" \"_aLias.node2;newLabel\", graph_51_c2.\"node2\" \"_aLias.node2;newVal\"\n", " FROM graph_51 AS graph_51_c2, graph_71 AS graph_71_c1\n", " WHERE graph_51_c2.\"node1\"=graph_71_c1.\"node1\"\n", " AND (graph_71_c1.\"label\" = graph_51_c2.\"label\")\n", " PARAS: []\n", "---------------------------------------------\n", "[2021-04-09 15:19:30 sqlstore]: CREATE INDEX on table graph_71 column node1 ...\n", "[2021-04-09 15:19:32 sqlstore]: ANALYZE INDEX on table graph_71 column node1 ...\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/claims.quantity.tsv.gz \\\n", " --match \"non: (x)-[r]->(y), quantity: (x)-[s]->(z)\" \\\n", " --where \"r.label = s.label\" \\\n", " --return 'x, r.label, y, s.label as `node2;newLabel`, z as `node2;newVal`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "convertible-softball", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3239699 ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv" ] }, { "cell_type": "code", "execution_count": 61, "id": "unlikely-overhead", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "historical-copying", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-09 15:26:38 sqlstore]: IMPORT graph directly into table graph_72 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv ...\n", "[2021-04-09 15:29:43 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", graph_72_c1.\"node2;newLabel\" \"_aLias.node2;newLabel\", max(graph_72_c1.\"node2;newVal\") \"_aLias.node2;newValue\", count(graph_72_c1.\"node2;newVal\") \"_aLias.node2;branching\"\n", " FROM graph_72 AS graph_72_c1\n", " WHERE graph_72_c1.\"node2;newLabel\"=graph_72_c1.\"node2;newLabel\"\n", " AND graph_72_c1.\"node2;newVal\"=graph_72_c1.\"node2;newVal\"\n", " GROUP BY graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", \"_aLias.node2;newLabel\"\n", " PARAS: []\n", "---------------------------------------------\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv \\\n", " --match \"(node1)-[r]->(node2{newLabel: newLabel, newVal: newValue})\" \\\n", " --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv" ] }, { "cell_type": "code", "execution_count": 4, "id": "waiting-citizenship", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 5, "id": "unlike-huntington", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node2;newLabelnode2;newValuenode2;branching
2501639Q999961P1082+17243[+17243,+17243]P1082+888327
2501640Q999961P1082+6925P1082+888327
2501641Q999961P1082+8653[+8653,+8653]P1082+888327
2501642Q999961P2046+23.95Q712226P2046+23.952616Q7122261
2501643Q999988P2046+1000[+1000,+1000]Q81292P2046+1000Q812921
\n", "
" ], "text/plain": [ " node1 label node2 node2;newLabel \\\n", "2501639 Q999961 P1082 +17243[+17243,+17243] P1082 \n", "2501640 Q999961 P1082 +6925 P1082 \n", "2501641 Q999961 P1082 +8653[+8653,+8653] P1082 \n", "2501642 Q999961 P2046 +23.95Q712226 P2046 \n", "2501643 Q999988 P2046 +1000[+1000,+1000]Q81292 P2046 \n", "\n", " node2;newValue node2;branching \n", "2501639 +8883 27 \n", "2501640 +8883 27 \n", "2501641 +8883 27 \n", "2501642 +23.952616Q712226 1 \n", "2501643 +1000Q81292 1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.tail()" ] }, { "cell_type": "code", "execution_count": 6, "id": "confident-carolina", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\r\n", "P1733\tP4876\t+1014280\tP4876\t+28977\t1\r\n", "P2040\tP4876\t+34596\tP4876\t+38623\t1\r\n", "P2349\tP4876\t+12367\tP4876\t+12500\t3\r\n", "P2427\tP4876\t+95000\tP4876\t+96793\t4\r\n", "P2518\tP4876\t+11126\tP4876\t+11145\t1\r\n", "P2725\tP4876\t+2232\tP4876\t+3907\t1\r\n", "P2816\tP4876\t+32155\tP4876\t+34149\t2\r\n", "P3289\tP4876\t+113576\tP4876\t+123199\t1\r\n", "P3400\tP4876\t+123817\tP4876\t+123817\t4\r\n" ] } ], "source": [ "!head ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv" ] }, { "cell_type": "code", "execution_count": 7, "id": "adjusted-discretion", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7\n" ] } ], "source": [ "import re\n", "test_str = \"+123817Q\"\n", "temp = re.search(r'[a-z]', test_str, re.I)\n", "if temp is not None:\n", " print(temp.start())\n", "else:\n", " print(\"Not found\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "toxic-heart", "metadata": {}, "outputs": [], "source": [ "def splitIntoParts(text):\n", " temp = re.search(r'[a-z]', text, re.I)\n", " firstAlpha1 = -1 if temp is None else temp.start()\n", " alpha1 = \"\" if firstAlpha1 == -1 else text[firstAlpha1:]\n", " text = text if firstAlpha1 == -1 else text[:firstAlpha1]\n", " \n", " temp = re.search(r'\\[', text, re.I)\n", " firstBracket1 = -1 if temp is None else temp.start()\n", " brack1 = \"\" if firstBracket1 == -1 else text[firstBracket1:]\n", " \n", " num1 = text if firstBracket1 == -1 else text[:firstBracket1]\n", " \n", " return num1, brack1, alpha1" ] }, { "cell_type": "code", "execution_count": 9, "id": "impressed-monthly", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('+1234', '[+1, -1]', 'Q12345')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitIntoParts(\"+1234[+1, -1]Q12345\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "sunset-fraction", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c86b1765daec4bc084f0c0f399a69dfd", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2501645 [00:00\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0mval1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0mval2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mIndexError\u001b[0m: list index out of range" ] } ], "source": [ "from dateutil.parser import parse\n", "import re\n", "import rltk\n", "from rltk.similarity import levenshtein_distance as ld\n", "from nltk.tokenize import word_tokenize as wt\n", "from tqdm.notebook import tqdm\n", "\n", "def is_num(string):\n", " try: \n", " float(string)\n", " return True\n", "\n", " except ValueError:\n", " return False\n", " \n", "f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",\"r\").read().split(\"\\n\")\n", "fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",\"w\")\n", "firstLine = f1[0]\n", "\n", "fNum.write(firstLine+\"\\tNumNE\\tRangeNE\\tNumNRangeNE\\tUnitNE\\n\")\n", "# fnonQnd.write(f1[0]+\"\\n\")\n", "\n", "for i in tqdm(range(1,len(f1))):\n", " line = f1[i]\n", " val1 = line.split(\"\\t\")[2]\n", " val2 = line.split(\"\\t\")[4]\n", " \n", " \n", " num1, brack1, alpha1 = splitIntoParts(val1)\n", " num2, brack2, alpha2 = splitIntoParts(val2)\n", " \n", "# print(val1, num1, brack1, alpha1)\n", "# print(val2, num2, brack2, alpha2)\n", " \n", " fNum.write(line + \"\\t\" + str(num1 != num2) + \"\\t\" + str(brack1 != brack2) + \"\\t\" + str((num1 != num2) and (brack1 != brack2)) + \"\\t\" + str(alpha1 != alpha2) + \"\\n\")\n", "\n", "fNum.close()" ] }, { "cell_type": "code", "execution_count": 11, "id": "impaired-venue", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "num_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 12, "id": "strange-alcohol", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node2;newLabelnode2;newValuenode2;branchingNumNERangeNENumNRangeNEUnitNE
0P1733P4876+1014280P4876+289771TrueFalseFalseFalse
1P2040P4876+34596P4876+386231TrueFalseFalseFalse
2P2349P4876+12367P4876+125003TrueFalseFalseFalse
3P2427P4876+95000P4876+967934TrueFalseFalseFalse
4P2518P4876+11126P4876+111451TrueFalseFalseFalse
\n", "
" ], "text/plain": [ " node1 label node2 node2;newLabel node2;newValue node2;branching \\\n", "0 P1733 P4876 +1014280 P4876 +28977 1 \n", "1 P2040 P4876 +34596 P4876 +38623 1 \n", "2 P2349 P4876 +12367 P4876 +12500 3 \n", "3 P2427 P4876 +95000 P4876 +96793 4 \n", "4 P2518 P4876 +11126 P4876 +11145 1 \n", "\n", " NumNE RangeNE NumNRangeNE UnitNE \n", "0 True False False False \n", "1 True False False False \n", "2 True False False False \n", "3 True False False False \n", "4 True False False False " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_df.head()" ] }, { "cell_type": "code", "execution_count": 13, "id": "hindu-merit", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv" ] }, { "cell_type": "code", "execution_count": 14, "id": "hollywood-boring", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 2.501575e+06\n", "mean 6.733284e+01\n", "std 5.003042e+02\n", "min 1.000000e+00\n", "25% 1.000000e+00\n", "50% 2.000000e+00\n", "75% 1.100000e+01\n", "max 2.132100e+04\n", "Name: node2;branching, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_df['node2;branching'].describe()" ] }, { "cell_type": "code", "execution_count": 15, "id": "moral-history", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 2501575 quantities, there are 1496454 cases where numbers have got updated, 2037283 cases where ranges have got updated, 1069289 cases where number and range both have got updated, 78048 cases were the unit has got updated\n" ] } ], "source": [ "print(f\"Out of {len(num_df)} quantities, there are {num_df['NumNE'].sum()} cases where numbers have got updated, {num_df['RangeNE'].sum()} cases where ranges have got updated, {num_df['NumNRangeNE'].sum()} cases where number and range both have got updated, {num_df['UnitNE'].sum()} cases were the unit has got updated\")" ] }, { "cell_type": "markdown", "id": "muslim-dryer", "metadata": {}, "source": [ "### Qnodes comparison" ] }, { "cell_type": "markdown", "id": "brilliant-picnic", "metadata": {}, "source": [ "#### Qnodes type segregation\n", "\n", "Here, for each qnode to qnode removed statement, we analyze:\n", "* How many statements have node1 which is an instance/subclass/both of something else\n", "* How many statements have node2 which is an instance/subclass/both of something else" ] }, { "cell_type": "code", "execution_count": null, "id": "described-america", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "universal-surprise", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv" ] }, { "cell_type": "code", "execution_count": 60, "id": "elder-tissue", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "killing-emphasis", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "answering-sheriff", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv" ] }, { "cell_type": "code", "execution_count": 61, "id": "intimate-sullivan", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv" ] }, { "cell_type": "code", "execution_count": 57, "id": "surprising-clone", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15682364 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv" ] }, { "cell_type": "code", "execution_count": 62, "id": "innovative-thread", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3500869 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv\n", " 3396316 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv\n", " 14206459 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv\n", " 21103644 total\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1*" ] }, { "cell_type": "code", "execution_count": 63, "id": "accompanied-lighting", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 10064419 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv\n", " 6622159 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv\n", " 12057758 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv\n", " 28744336 total\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2*" ] }, { "cell_type": "markdown", "id": "verified-vegetable", "metadata": {}, "source": [ "#### Qnodes to Qnodes (instance/subclass analysis)\n", "\n", "Here, we analyze how many P31 relations were deleted, how many were updated to P31/P279/nothing. We do the same thing for P279 relations that were deleted" ] }, { "cell_type": "code", "execution_count": null, "id": "quick-welsh", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --return 'count(a)' \\\n", " --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P31.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "satisfied-philosophy", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "southern-daisy", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P279]->(b)' \\\n", " --return 'count(a)' \\\n", " --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P279.tsv" ] }, { "cell_type": "code", "execution_count": 1, "id": "subtle-tract", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P279]->(b)' \\\n", " --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv" ] }, { "cell_type": "markdown", "id": "opponent-bible", "metadata": {}, "source": [ "##### Analyze for P31 relations" ] }, { "cell_type": "code", "execution_count": 4, "id": "soviet-liverpool", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv" ] }, { "cell_type": "code", "execution_count": 5, "id": "imposed-pound", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv" ] }, { "cell_type": "code", "execution_count": 16, "id": "provincial-limit", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv" ] }, { "cell_type": "code", "execution_count": 6, "id": "dynamic-persian", "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n", " ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv" ] }, { "cell_type": "code", "execution_count": 7, "id": "material-routine", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": 18, "id": "aboriginal-injection", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3611396 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n", "2864334 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n", "150123 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n", "106540 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n", "703480 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "perceived-hopkins", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n", " --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv" ] }, { "cell_type": "code", "execution_count": 1, "id": "antique-neighborhood", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv" ] }, { "cell_type": "code", "execution_count": 2, "id": "alleged-destiny", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 626925 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv\r\n", " 76556 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv\r\n", " 703481 total\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_*" ] }, { "cell_type": "markdown", "id": "opposed-palmer", "metadata": {}, "source": [ "##### Analyze for P279 relations" ] }, { "cell_type": "code", "execution_count": 8, "id": "hybrid-hacker", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv" ] }, { "cell_type": "code", "execution_count": 9, "id": "reliable-ontario", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv" ] }, { "cell_type": "code", "execution_count": 17, "id": "radio-bumper", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv" ] }, { "cell_type": "code", "execution_count": 10, "id": "loving-switzerland", "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n", " ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv" ] }, { "cell_type": "code", "execution_count": 11, "id": "prostate-trace", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": 19, "id": "subsequent-recovery", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "935667 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n", "865917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n", "454917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n", "421734 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n", "36568 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "hazardous-liberal", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "^C\r\n", "\r\n", "Keyboard interrupt in ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz --filter-mode NONE --input-keys node1 --filter-keys node1 -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv.\r\n" ] } ], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n", " --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "manual-embassy", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv" ] }, { "cell_type": "code", "execution_count": 4, "id": "determined-wonder", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 35004 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv\r\n", " 1565 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv\r\n", " 36569 total\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_*" ] }, { "cell_type": "markdown", "id": "dramatic-spyware", "metadata": {}, "source": [ "Fin." ] }, { "cell_type": "code", "execution_count": null, "id": "general-hometown", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kgtkEnv", "language": "python", "name": "kgtkenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "288px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }