{ "cells": [ { "cell_type": "markdown", "id": "statutory-onion", "metadata": {}, "source": [ "# Understanding Removed Statements Dataset\n", "\n", "Source of data: [GDrive | Removed Stataments of Wikidata | Feb 1 2021](https://drive.google.com/file/d/1TQP1rADdvhDjsvBpLzSE9Bx3n73wf-Md/view?usp=sharing)\n", "\n", "Steps performed:\n", "* Divide dataset into 2 halves - redirected and non-redirected. Redirected dataset has either node1 or node2 as redirected. But non-redirected has both node1, node2 not redirected\n", "\n", "\n", "**Summary**\n", "\n", "Removed Statements dataset has 76.5M removed statements. Out of these, " ] }, { "cell_type": "markdown", "id": "christian-mounting", "metadata": {}, "source": [ "## Redirects determination and division of dataset into 2 halves\n", "\n", "* Since, redirects dataset was not present, a SPARQL query was run to determine all the redirects existing at the moment. This was done on Feb 19, 2021. This was executed using [SPARQL query](https://query.wikidata.org/). Query run was:\n", " ```\n", " SELECT ?old_node\n", " WHERE {\n", " ?old_node owl:sameAs ?new_node.\n", " }\n", " ```\n", "* This has few lexemes as well which we don't need. So, I then ran the query:\n", " ```\n", " SELECT ?old_node\n", " WHERE {\n", " ?old_node owl:sameAs ?new_node.\n", " ?new_node rdf:type ontolex:LexicalEntry.\n", " }\n", " ```\n", "* After removing the lexemes from the nodes file, a final redirected non-lexemes file was created with data from Feb 19, 2021: `data/SPARQL_redirects_non-lexemes.tsv`.\n", "* Using this reduced dataset, I was able to determine in the removed_statements.tsv dataset, which nodes have been redirected - `../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`. This has removed statements in which either node1 or node2 is redirected.\n", "* After this, I am extracting the removed statements not present in this subset meaning it would correspond to all removed statements in neither node1 nor node2 is redirected - `../opAnalysis/removed_statements_both_nonredirects.tsv`\n", "\n", "For this, I am using the following set of commands" ] }, { "cell_type": "code", "execution_count": 2, "id": "thick-absorption", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "id": "boolean-string", "metadata": {}, "outputs": [], "source": [ "# On the basis of SPARQL\n", "!kgtk ifexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_redirects_basis_node1.tsv\n", "!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_basis_node1.tsv\n", "!kgtk ifexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_redirects_basis_node2.tsv\n", "!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../data/SPARQL_redirects_non-lexemes.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys id \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_basis_node2.tsv\n", "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_redirects_basis_node1.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n", " -o ../../opAnalysis/temp1.tsv\n", "!kgtk cat -i ../../opAnalysis/temp1.tsv \\\n", " ../../opAnalysis/removed_statements_redirects_basis_node2.tsv \\\n", " -o ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv\n", "!kgtk ifnotexists -i ../../data/removed_statements.tsv\\\n", " --filter-on ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects.tsv" ] }, { "cell_type": "markdown", "id": "committed-volunteer", "metadata": {}, "source": [ "## P31 edges distribution" ] }, { "cell_type": "markdown", "id": "objective-range", "metadata": {}, "source": [ "Now, we'll determine in this redirected dataset - `../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv`, how many of these are P31 edges and determine more stats on these" ] }, { "cell_type": "markdown", "id": "final-fraud", "metadata": {}, "source": [ "### For Redirected Removed Statements" ] }, { "cell_type": "code", "execution_count": null, "id": "analyzed-silicon", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --return 'b, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_redirects_P31_stats1.tsv" ] }, { "cell_type": "code", "execution_count": 7, "id": "smaller-eugene", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
Q41678365262070.213808
Q173292593013590.122448
Q52228090.090531
Q41674101085830.044119
Q134428141011560.041102
Q7187882310.035850
Q11266439610070.024788
Q4423781536710.021808
Q17143521515810.020958
Q15917122506420.020577
Q486972492570.020014
Q16521465220.018903
Q318267220.010858
Q532237210.009638
Q20900710234820.009541
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "Q4167836 526207 0.213808\n", "Q17329259 301359 0.122448\n", "Q5 222809 0.090531\n", "Q4167410 108583 0.044119\n", "Q13442814 101156 0.041102\n", "Q7187 88231 0.035850\n", "Q11266439 61007 0.024788\n", "Q4423781 53671 0.021808\n", "Q17143521 51581 0.020958\n", "Q15917122 50642 0.020577\n", "Q486972 49257 0.020014\n", "Q16521 46522 0.018903\n", "Q318 26722 0.010858\n", "Q532 23721 0.009638\n", "Q20900710 23482 0.009541" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_P31_stats1.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "suburban-cosmetic", "metadata": {}, "source": [ "### For non-redirected removed statements" ] }, { "cell_type": "code", "execution_count": null, "id": "characteristic-still", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --return 'b, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv" ] }, { "cell_type": "code", "execution_count": 9, "id": "subsequent-dutch", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
Q41678363688880.102453
Q41674101324030.036773
Q51302520.036176
Q5711268830.035240
Q112664391258240.034946
Q8389481199280.033308
Q4869721081050.030025
Q5321067860.029658
Q7837941011210.028085
Q1539532781860.021715
Q916333627890.017439
Q16521534020.014832
Q7366450050.012499
Q13406463425820.011827
Q18593264405050.011250
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "Q4167836 368888 0.102453\n", "Q4167410 132403 0.036773\n", "Q5 130252 0.036176\n", "Q571 126883 0.035240\n", "Q11266439 125824 0.034946\n", "Q838948 119928 0.033308\n", "Q486972 108105 0.030025\n", "Q532 106786 0.029658\n", "Q783794 101121 0.028085\n", "Q1539532 78186 0.021715\n", "Q916333 62789 0.017439\n", "Q16521 53402 0.014832\n", "Q7366 45005 0.012499\n", "Q13406463 42582 0.011827\n", "Q18593264 40505 0.011250" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_P31_stats1.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "whole-influence", "metadata": {}, "source": [ "## Properties Distribution" ] }, { "cell_type": "markdown", "id": "international-conditioning", "metadata": {}, "source": [ "### For redirected removed statements" ] }, { "cell_type": "code", "execution_count": null, "id": "known-moore", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_redirects_basis_node1or2.tsv \\\n", " --match 'o: (a)-[r]->(b)' \\\n", " --return 'r.label, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_redirects_props_dist.tsv" ] }, { "cell_type": "code", "execution_count": 6, "id": "unlikely-default", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
P3123810720.234921
P173572860.035251
P14332994640.029546
P7352957780.029182
P502684120.026482
P28602436070.024035
P6252277790.022473
P1061851840.018271
P1311837590.018130
P211790690.017667
P9211677230.016548
P2791623940.016022
P15661602130.015807
P6841526950.015065
P7031191820.011759
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "P31 2381072 0.234921\n", "P17 357286 0.035251\n", "P1433 299464 0.029546\n", "P735 295778 0.029182\n", "P50 268412 0.026482\n", "P2860 243607 0.024035\n", "P625 227779 0.022473\n", "P106 185184 0.018271\n", "P131 183759 0.018130\n", "P21 179069 0.017667\n", "P921 167723 0.016548\n", "P279 162394 0.016022\n", "P1566 160213 0.015807\n", "P684 152695 0.015065\n", "P703 119182 0.011759" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_redirects_props_dist.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "satisfactory-future", "metadata": {}, "source": [ "### For non-redirected removed statements" ] }, { "cell_type": "code", "execution_count": null, "id": "seasonal-composite", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " --match 'o: (a)-[r]->(b)' \\\n", " --return 'r.label, count(distinct a)' \\\n", " -o ../../opAnalysis/removed_statements_nonredirects_props_dist.tsv" ] }, { "cell_type": "code", "execution_count": 11, "id": "straight-haiti", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countperc
parent
P209361733930.161314
P147642384870.110754
P3133276440.086953
P56920115390.052563
P62514944100.039050
P57711163280.029170
P2349995220.026118
P5709832010.025692
P1319274130.024234
P3648702240.022739
P20447808700.020405
P2797651120.019993
P9697324610.019140
P3564134390.010803
P6373870910.010115
\n", "
" ], "text/plain": [ " count perc\n", "parent \n", "P2093 6173393 0.161314\n", "P1476 4238487 0.110754\n", "P31 3327644 0.086953\n", "P569 2011539 0.052563\n", "P625 1494410 0.039050\n", "P577 1116328 0.029170\n", "P234 999522 0.026118\n", "P570 983201 0.025692\n", "P131 927413 0.024234\n", "P364 870224 0.022739\n", "P2044 780870 0.020405\n", "P279 765112 0.019993\n", "P969 732461 0.019140\n", "P356 413439 0.010803\n", "P637 387091 0.010115" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.read_csv('../../opAnalysis/removed_statements_nonredirects_props_dist.tsv',sep='\\t')\n", "df1.columns = ['parent','count']\n", "df1 = df1.sort_values(by=['count'],ascending=False)\n", "df1 = df1.set_index('parent')\n", "tot = df1['count'].sum()\n", "df1['perc'] = df1['count'] / tot\n", "df1.head(15)" ] }, { "cell_type": "markdown", "id": "martial-friday", "metadata": {}, "source": [ "# Comparison Removed NR dataset with Qnodes, literals\n", "\n", "First, let's split this dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "higher-photograph", "metadata": {}, "outputs": [], "source": [ "from dateutil.parser import parse\n", "import re\n", "import rltk\n", "from rltk.similarity import levenshtein_distance as ld\n", "from nltk.tokenize import word_tokenize as wt\n", "\n", "def is_num(string):\n", " try: \n", " float(string)\n", " return True\n", "\n", " except ValueError:\n", " return False\n", "\n", "f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects.tsv\",\"r\").read().split(\"\\n\")\n", "fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv\",\"w\")\n", "fDat = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_date.tsv\",\"w\")\n", "fQnd = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv\",\"w\")\n", "fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv\",\"w\")\n", "fnonQnd = open(\"../../opAnalysis/removed_statements_both_nonredirects_node2_lit.tsv\",\"w\")\n", "\n", "fStr.write(f1[0]+\"\\n\")\n", "fDat.write(f1[0]+\"\\n\")\n", "fQnd.write(f1[0]+\"\\n\")\n", "fNum.write(f1[0]+\"\\n\")\n", "fnonQnd.write(f1[0]+\"\\n\")\n", "\n", "for i in range(1,len(f1)):\n", " val1 = f1[i].split(\"\\t\")[3]\n", " if val1.startswith('Q'):\n", " fQnd.write(f1[i]+\"\\n\")\n", "# elif bool(re.search(\"\\^\\d{11}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z\\/\\d{,2}\",val1)):\n", " elif val1.startswith(\"^\"):\n", " fDat.write(f1[i]+\"\\n\")\n", " fnonQnd.write(f1[i]+\"\\n\")\n", " elif is_num(val1):\n", " fNum.write(f1[i]+\"\\n\")\n", " fnonQnd.write(f1[i]+\"\\n\")\n", " else:\n", " fStr.write(f1[i]+\"\\n\")\n", " fnonQnd.write(f1[i]+\"\\n\")\n", "\n", "fQnd.close()\n", "fDat.close()\n", "fNum.close()\n", "fStr.close()\n", "fnonQnd.close()" ] }, { "cell_type": "markdown", "id": "rough-emerald", "metadata": {}, "source": [ "### String Comparison" ] }, { "cell_type": "code", "execution_count": null, "id": "amateur-effort", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n", " ../gdrive-kgtk-dump-2020-12-07/claims.string.tsv.gz \\\n", " --match \"r: (x)-[r]->(y), c: (x)-[s]->(z)\" \\\n", " --where \"r.label = s.label\" \\\n", " --return 'x, r.label, y, s.label as newNode2Label, z as newNode2' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n", " --graph-cache ~/temp2.sqlite3.db" ] }, { "cell_type": "code", "execution_count": null, "id": "separate-georgia", "metadata": {}, "outputs": [], "source": [ "!sed -i '1s/.*/node1\\tlabel\\tnode2\\tnode2;newLabl\\tnode2;nw/' removed_statements_both_nonredirects_str_new_vals.tsv" ] }, { "cell_type": "markdown", "id": "disturbed-geology", "metadata": {}, "source": [ "The strings subset has a branching factor of approx 10. i.e. 1 removed statement with string literal has been replaced by around 10 new statements (with same node1-label combination). Doing the same comparisons won't give us much insights. Instead, let's truncate this dataset while retaining just the counts of branching factor from each of these node1-label combinations. " ] }, { "cell_type": "code", "execution_count": 1, "id": "fancy-photographer", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14091663 ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "downtown-alabama", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n", " --match \"(node1)-[r]->(node2{newLabl: newLabel, nw: newValue})\" \\\n", " --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv \\\n", " --graph-cache ~/sqlite3_caches/temptrunc.sqlite3.db" ] }, { "cell_type": "markdown", "id": "tropical-cooperation", "metadata": {}, "source": [ "On this truncated dataset, we will next compute the stats and comparisons. Note: Our original string literals subset of removed statements was around 9 GB. With the join operation with claims, this had increased to 90 GB. We have now truncated this dataset to 778 MB" ] }, { "cell_type": "code", "execution_count": null, "id": "successful-singer", "metadata": {}, "outputs": [], "source": [ "from dateutil.parser import parse\n", "import re\n", "import rltk\n", "from rltk.similarity import levenshtein_distance as ld\n", "from nltk.tokenize import word_tokenize as wt\n", "from tqdm.notebook import tqdm\n", "\n", "f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_truncated.tsv\",\"r\")\n", "fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv\",\"w\")\n", "\n", "firstLine = next(f1).rstrip()\n", "\n", "fStr.write(firstLine+\"\\tVersionBool\\tRangeBool\\tLevDist\\tRearranged\\tRearrangedFirstNP\\n\")\n", "\n", "for line in tqdm(f1):\n", " val1 = line.split(\"\\t\")[2]\n", " val2 = line.split(\"\\t\")[4]\n", " val2 = val2[1:-1]\n", " versionBool = bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",val1))\n", " rangeBool = bool(re.fullmatch(\"[\\d]+[-|–][\\d]+\",val1))\n", " LevDist = ld(val1,val2)\n", " rearranged = set(wt(val1)) == set(wt(val2))\n", " rearrangedFirstNP = set(wt(val1)) == set(wt(val2[1:]))\n", " fStr.write(line+ \"\\t\" + str(versionBool) + \"\\t\" + str(rangeBool) + \"\\t\" + \\\n", " str(LevDist) + \"\\t\" + str(rearranged) + \"\\t\" + str(rearrangedFirstNP) + \"\\n\")\n", "\n", "fStr.close()" ] }, { "cell_type": "code", "execution_count": 10, "id": "international-violation", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "list index out of range\r\n" ] } ], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys label node1 \\\n", " --filter-keys label node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_string_unmatched.tsv" ] }, { "cell_type": "code", "execution_count": 17, "id": "tracked-carroll", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1923347844 ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv" ] }, { "cell_type": "code", "execution_count": 13, "id": "vocational-pound", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14091661 ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv" ] }, { "cell_type": "code", "execution_count": 11, "id": "trained-tuning", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16922586 ../../opAnalysis/removed_statements_both_nonredirects_node2_string_unmatched.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_string_unmatched.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "daily-complexity", "metadata": {}, "outputs": [], "source": [ "str_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_str_new_vals_measured.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": null, "id": "otherwise-bones", "metadata": {}, "outputs": [], "source": [ "str_df.head()" ] }, { "cell_type": "code", "execution_count": 5, "id": "restricted-locking", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "136.48837958054622" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df['node2;branching'].mean()" ] }, { "cell_type": "code", "execution_count": 6, "id": "hundred-entrepreneur", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 2084163\n", "2 1739757\n", "3 1645943\n", "4 1530528\n", "5 1209068\n", " ... \n", "12813 2\n", "12840 1\n", "13554 1\n", "18192 1\n", "29360 1\n", "Name: node2;branching, Length: 2191, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df['node2;branching'].value_counts().sort_index()" ] }, { "cell_type": "code", "execution_count": 7, "id": "secret-contest", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14091660" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df['node2;branching'].value_counts().sum()" ] }, { "cell_type": "code", "execution_count": 36, "id": "editorial-romance", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 14091660 updates, 25167 correspond to changes due to version change with average branching factor: 5.597170898398697\n" ] } ], "source": [ "print(f\"Out of {len(str_df)} updates, {str_df['VersionBool'].sum()} correspond to changes due to version change with average branching factor: {str_df[str_df['VersionBool'] == True]['node2;branching'].mean()}\")" ] }, { "cell_type": "code", "execution_count": 29, "id": "social-plenty", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 25167.000000\n", "mean 4.792625\n", "std 6.162759\n", "min 0.000000\n", "25% 1.000000\n", "50% 2.000000\n", "75% 5.000000\n", "max 63.000000\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df[str_df['VersionBool'] == True].LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 37, "id": "promising-hopkins", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 14091660 updates, 321952 correspond to changes due to range change with average branching factor: 1.0656495378193023\n" ] } ], "source": [ "print(f\"Out of {len(str_df)} updates, {str_df['RangeBool'].sum()} correspond to changes due to range change with average branching factor: {str_df[str_df['RangeBool'] == True]['node2;branching'].mean()}\")" ] }, { "cell_type": "code", "execution_count": 30, "id": "varied-reform", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "count 321952.000000\n", "mean 2.343707\n", "std 2.188651\n", "min 0.000000\n", "25% 1.000000\n", "50% 2.000000\n", "75% 3.000000\n", "max 47.000000\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df[str_df['RangeBool'] == True].LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 38, "id": "annoying-transaction", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 14091660 updates, 229782 correspond to changes due to rearrangement with average branching factor: 3.5381753139932632\n" ] } ], "source": [ "print(f\"Out of {len(str_df)} updates, {str_df['Rearranged'].sum()} correspond to changes due to rearrangement with average branching factor: {str_df[str_df['Rearranged'] == True]['node2;branching'].mean()}\")" ] }, { "cell_type": "code", "execution_count": 32, "id": "three-characteristic", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 229782.000000\n", "mean 2.934938\n", "std 1.989685\n", "min 0.000000\n", "25% 1.000000\n", "50% 4.000000\n", "75% 4.000000\n", "max 56.000000\n", "Name: LevDist, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str_df[str_df['Rearranged'] == True].LevDist.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "military-coordinator", "metadata": {}, "outputs": [], "source": [ "str_df.LevDist.describe()" ] }, { "cell_type": "code", "execution_count": 14, "id": "european-treat", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'count v/s Lev edit distances')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEICAYAAABPgw/pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAaE0lEQVR4nO3df5zcVX3v8debhEDJYvgRu8YkkNCmai7RSrb8KLTuVtSASB73lvYmN0Wo0PRRGx/eKtUg3IhYW9GLFgGLuV4uV4hZkSKkNBJbZMu9F6GQKoRAgysEkwgJEggupIXUz/3je9Z8M53dmZ18d2dzfD8fj3lkvt9z9nw/c3bmPd85MztRRGBmZge+g9pdgJmZVcOBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhm+0nSZZJuStePkTQgacJ+jLdZ0unp+sckfbmqWi1vDnRrWTl49mOMxZK+OtbHHS0R8cOI6IiIfwOQ1Cfpwv0Y788jouHP7+9xLA8OdGu3dwNr212EWQ4c6JmQNFPSrZKelfScpGvS/oMkXSrpKUk7JH1F0pTU1i1pa8045Zf7l0m6Of3MTyRtlNSV2m4EjgH+Ji0xfKROTY9JOqu0PTHVd8JgbcA7gDslHSrpplT7C5IekNQ5wjk4SNJyST9I49ws6ajU9k1Jy2r6PyTpPw0x1smS7k21PCSpu9Q2W9I/pDn5O2BqqW2WpEi39VPAbwDXpDm6ZohjnZt+P89JuqSmrbycU3eOhjqOpKskbZH0oqT1kn6jZty6v9vUXvf+lNrel363z0taJ+nYtF+SPp/uZy9K2iDp+GF/aVatiGjbBbge2AE80kTfzwPfS5fHgRfaWft4ugATgIfSHE0GDgVOS23vA/qB44AO4FbgxtTWDWytGWszcHq6fhnwL8CZ6Rh/AdxXr+8Qda0AVpW23w08Vto+GfhOuv6HwN8Ah6VjzQdeM8S4dY8LfBC4D5gBHAJ8CVid2t4L/L9S37nAC8AhdcaZDjyXbvfgk85zwGtT+3eAz6Vj/CbwE+Cm1DYLCGBi2u4DLhxmjuYCA2mcQ9K4e2p+Bzc1mqN6xwF+DzgamAh8GHgGOLTR75bh708LKe5Pb0rjXgrcm9reBawHjgCU+kxr9+Pj5+nS3oMXd+ITaCLQa37uA8D17Z688XIBTgGeHQyRmra7gPeXtt8AvJoejN00DvS/L7XNBXbX6ztEXb+cwu6wtL0KWFFq/yTw39L19wH3Am9u4vbWPS7wGPD20va00m09HHgJODa1fWqo+xDwUdKTXmnfOuA8ilcle4DJpbav0nqgrwB6S9uTgVeoH+hDzlGj46Q+zwNvafS7bXB/+iZwQWn7IOBl4FjgtyhOtk4GDmr34+Ln8dLWJZeIuAfYWd4n6Zck3ZleIv4fSW+s86OLgdVjUuSBYSbwVETsqdP2euCp0vZTFAHX7HLGM6XrLwOHSprYzA9GRD9FyL5H0mHA2RThN+hM9q6f30gRmr2SfiTpM5IObrLGQccC30jLES+kY/8b0BkRPwH+FliU+i6meIIZapzfGRwnjXUaxRPE64HnI+KlUv+n6ozRrNcDWwY30rjPDdF3RHMk6aK0NLIr3YYplJaHGPp3O9z96VjgqtK87KQ4G58eEd8GrgGuBXZIWinpNcPdeKvWeFxDXwl8ICLmAxcBXyw3pvW62cC321DbeLUFOGaIoP0RxYNw0OAZ5naKM9bDBhtUfNTutSM4bjPfvbyaIjwXAo+mkEfS6ygC8p8AIuLViPhERMwFfh04i2KZZCS2AGdExBGly6ERsa1ci6RTKJYR7h5mnBtrxpkcEZ8GngaOlDS51P+YYWpqNEdPUwQoAOmJ7+i6Aw0/R/scJ62XfwT4XeDIiDgC2EURvo0Md3/aAvxhzdz8QkTcm2r8QnrszgV+BfjTJo5nFRlXgS6pg+KO+nVJ36NYA51W020RcEukj4UZAP9IEQyfljQ5vXl2ampbDfxJeiOvA/hz4Gvp7OtxirOyd6czvUsp1nGbtZ1ibX44vcA7gT9i37PzM4A7I4rX7ZJ6JM1LTyovUiyV/HSYcQ9Ot3PwMhG4DvhU6U2610paWPqZtRRPbpdTzMFQ499E8ariXZImpPG7Jc2IiKeAB4FPSJok6TTgPcPU2WiObgHOknSapEmptrqPywZzVHucwymeuJ8FJkpaATR7tjzc/ek64GJJ/yHVNEXS76TrvybppHRfeolijX6436FVbFwFOkU9L0TEr5Yub6rpswgvt+wjPbm9h2LN+ofAVuA/p+brKV6q3wM8SfEg+0D6uV3A+4EvA9soHoT7fOqlgb8ALk0vvy8aoranKd5E/HXga6Wm2o8rvo4i3F6kWCr5h1T3UNYCu0uXy4CrgDXAtyT9hOIN0pNKtfwrxZvCp7Pvk0ttzVsoXlF8jCIQt1CcaQ4+Xv5LGncn8HHgK8PUeRVwTvpEyBfqHGsj8Mepnqcp1rmH+h0MN0e1x1kH3EnxpP0Uxe99y78bsY7h7k8R8Q3gCoplnxeBRyienKF4wvgf6TY8RbF09NlmjmnVUDpBal8B0izgjog4Pm3fC3w+Ir4uSRRvAD2U2t5IcSedHe0u3FqWzqafAY6LiBfbXY9ZLtp6hi5pNcXZ2xskbZV0AbAEuEDSQ8BGijOlQYsoPhHgMD+wHUXx6RaHuVmF2n6GbmZm1Rhva+hmZtaipj5PPBqmTp0as2bNaulnX3rpJSZPnty44zhwoNTqOqvlOqvlOvdav379jyOi/seL2/UXTfPnz49W3X333S3/7Fg7UGp1ndVyndVynXsBD8Z4/EtRMzOrjgPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDQNd0vXpv5R6pEG/X5O0R9I51ZVnZmbNauYM/QZgwXAd0td5XgF8q4KazMysBQ0DPer8r0J1fAD4a4r/H9TMzNqgqS/nqv2K25q26RTf5dxD8d3bd0TELUOMsxRYCtDZ2Tm/t7e3paJ37NzF9t312+ZNn9LSmKNlYGCAjo6OdpfRkOusluusluvcq6enZ31EdNVrq+K7XP4S+GhE/LT4+vKhRcRKiv9ijq6uruju7m7pgFevup0rN9QvffOS1sYcLX19fbR6O8eS66yW66yW62xOFYHeRfG/l0DxH9CeKWlPRNxWwdhmZtak/Q70iJg9eF3SDRRLLrft77hmZjYyDQM9/a9C3cBUSVsp/g/FgwEi4rpRrc7MzJrWMNAjYnGzg0XE+ftVjZmZtcx/KWpmlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlomGgS7pekk7JD0yRPsSSQ9L2iDpXklvqb5MMzNrpJkz9BuABcO0Pwm8LSLmAZ8EVlZQl5mZjdDERh0i4h5Js4Zpv7e0eR8wo4K6zMxshBQRjTsVgX5HRBzfoN9FwBsj4sIh2pcCSwE6Ozvn9/b2jrhggB07d7F9d/22edOntDTmaBkYGKCjo6PdZTTkOqvlOqvlOvfq6elZHxFd9doanqE3S1IPcAFw2lB9ImIlaUmmq6sruru7WzrW1atu58oN9UvfvKS1MUdLX18frd7OseQ6q+U6q+U6m1NJoEt6M/Bl4IyIeK6KMc3MbGT2+2OLko4BbgXOjYjH978kMzNrRcMzdEmrgW5gqqStwMeBgwEi4jpgBXA08EVJAHuGWt8xM7PR08ynXBY3aL8QqPsmqJmZjR3/paiZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpaJhoEu6XpJOyQ9MkS7JH1BUr+khyWdUH2ZZmbWSDNn6DcAC4ZpPwOYky5Lgb/a/7LMzGykGgZ6RNwD7Bymy0LgK1G4DzhC0rSqCjQzs+YoIhp3kmYBd0TE8XXa7gA+HRH/N23fBXw0Ih6s03cpxVk8nZ2d83t7e1sqesfOXWzfXb9t3vQpLY05WgYGBujo6Gh3GQ25zmq5zmq5zr16enrWR0RXvbaJo3rkGhGxElgJ0NXVFd3d3S2Nc/Wq27lyQ/3SNy9pbczR0tfXR6u3cyy5zmq5zmq5zuZU8SmXbcDM0vaMtM/MzMZQFYG+Bnhv+rTLycCuiHi6gnHNzGwEGi65SFoNdANTJW0FPg4cDBAR1wFrgTOBfuBl4PdHq1gzMxtaw0CPiMUN2gP448oqMjOzlvgvRc3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMuFANzPLhAPdzCwTDnQzs0w40M3MMtFUoEtaIGmTpH5Jy+u0HyPpbknflfSwpDOrL9XMzIbTMNAlTQCuBc4A5gKLJc2t6XYpcHNEvBVYBHyx6kLNzGx4zZyhnwj0R8QTEfEK0AssrOkTwGvS9SnAj6or0czMmqGIGL6DdA6wICIuTNvnAidFxLJSn2nAt4AjgcnA6RGxvs5YS4GlAJ2dnfN7e3tbKnrHzl1s312/bd70KS2NOVoGBgbo6OhodxkNuc5quc5quc69enp61kdEV722iRUdYzFwQ0RcKekU4EZJx0fET8udImIlsBKgq6sruru7WzrY1atu58oN9UvfvKS1MUdLX18frd7OseQ6q+U6q+U6m9PMkss2YGZpe0baV3YBcDNARHwHOBSYWkWBZmbWnGYC/QFgjqTZkiZRvOm5pqbPD4G3A0h6E0WgP1tloWZmNryGgR4Re4BlwDrgMYpPs2yUdLmks1O3DwN/IOkhYDVwfjRanDczs0o1tYYeEWuBtTX7VpSuPwqcWm1pZmY2Ev5LUTOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBNNBbqkBZI2SeqXtHyIPr8r6VFJGyV9tdoyzcyskYmNOkiaAFwLvAPYCjwgaU1EPFrqMwe4GDg1Ip6X9IujVbCZmdXXzBn6iUB/RDwREa8AvcDCmj5/AFwbEc8DRMSOass0M7NGFBHDd5DOARZExIVp+1zgpIhYVupzG/A4cCowAbgsIu6sM9ZSYClAZ2fn/N7e3paK3rFzF9t312+bN31KS2OOloGBATo6OtpdRkOus1qus1quc6+enp71EdFVr63hkkuTJgJzgG5gBnCPpHkR8UK5U0SsBFYCdHV1RXd3d0sHu3rV7Vy5oX7pm5e0NuZo6evro9XbOZZcZ7VcZ7VcZ3OaWXLZBswsbc9I+8q2Amsi4tWIeJLibH1ONSWamVkzmgn0B4A5kmZLmgQsAtbU9LmN4uwcSVOBXwGeqK5MMzNrpGGgR8QeYBmwDngMuDkiNkq6XNLZqds64DlJjwJ3A38aEc+NVtFmZvbvNbWGHhFrgbU1+1aUrgfwoXQxM7M28F+KmpllwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5llwoFuZpYJB7qZWSYc6GZmmXCgm5lloqlAl7RA0iZJ/ZKWD9PvtyWFpK7qSjQzs2Y0DHRJE4BrgTOAucBiSXPr9Dsc+CBwf9VFmplZY82coZ8I9EfEExHxCtALLKzT75PAFcC/VFifmZk1SRExfAfpHGBBRFyYts8FToqIZaU+JwCXRMRvS+oDLoqIB+uMtRRYCtDZ2Tm/t7e3paJ37NzF9t312+ZNn9LSmKNlYGCAjo6OdpfRkOusluusluvcq6enZ31E1F3Wnri/g0s6CPgccH6jvhGxElgJ0NXVFd3d3S0d8+pVt3Plhvqlb17S2pijpa+vj1Zv51hyndVyndVync1pZsllGzCztD0j7Rt0OHA80CdpM3AysMZvjJqZja1mAv0BYI6k2ZImAYuANYONEbErIqZGxKyImAXcB5xdb8nFzMxGT8NAj4g9wDJgHfAYcHNEbJR0uaSzR7tAMzNrTlNr6BGxFlhbs2/FEH27978sMzMbKf+lqJlZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlgkHuplZJhzoZmaZcKCbmWXCgW5mlommAl3SAkmbJPVLWl6n/UOSHpX0sKS7JB1bfalmZjachoEuaQJwLXAGMBdYLGluTbfvAl0R8WbgFuAzVRdqZmbDa+YM/USgPyKeiIhXgF5gYblDRNwdES+nzfuAGdWWaWZmjSgihu8gnQMsiIgL0/a5wEkRsWyI/tcAz0TEn9VpWwosBejs7Jzf29vbUtE7du5i++76bfOmT2lpzNEyMDBAR0dHu8toyHVWy3VWy3Xu1dPTsz4iuuq1TazyQJJ+D+gC3lavPSJWAisBurq6oru7u6XjXL3qdq7cUL/0zUtaG3O09PX10ertHEuus1qus1qusznNBPo2YGZpe0batw9JpwOXAG+LiH+tpjwzM2tWM2voDwBzJM2WNAlYBKwpd5D0VuBLwNkRsaP6Ms3MrJGGgR4Re4BlwDrgMeDmiNgo6XJJZ6dunwU6gK9L+p6kNUMMZ2Zmo6SpNfSIWAusrdm3onT99IrrMjOzEfJfipqZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZcKBbmaWCQe6mVkmHOhmZplwoJuZZaKpQJe0QNImSf2SltdpP0TS11L7/ZJmVV6pmZkNq2GgS5oAXAucAcwFFkuaW9PtAuD5iPhl4PPAFVUXamZmw5vYRJ8Tgf6IeAJAUi+wEHi01GchcFm6fgtwjSRFRFRYa1NmLf/buvs3f/rdY1yJmdnYaibQpwNbSttbgZOG6hMReyTtAo4GflzuJGkpsDRtDkja1ErRwNTasRtR+14zjLjWNnGd1XKd1XKdex07VEMzgV6ZiFgJrNzfcSQ9GBFdFZQ06g6UWl1ntVxntVxnc5p5U3QbMLO0PSPtq9tH0kRgCvBcFQWamVlzmgn0B4A5kmZLmgQsAtbU9FkDnJeunwN8ux3r52ZmP88aLrmkNfFlwDpgAnB9RGyUdDnwYESsAf4ncKOkfmAnReiPpv1ethlDB0qtrrNarrNarrMJ8om0mVke/JeiZmaZcKCbmWXigAv0Rl9DMMa1zJR0t6RHJW2U9MG0/yhJfyfp++nfI9N+SfpCqv1hSSeMcb0TJH1X0h1pe3b6qob+9NUNk9L+tn2Vg6QjJN0i6Z8lPSbplPE4n5L+JP3OH5G0WtKh42U+JV0vaYekR0r7RjyHks5L/b8v6bx6xxqFOj+bfvcPS/qGpCNKbRenOjdJeldp/6hmQr06S20flhSSpqbtts0nABFxwFwo3pT9AXAcMAl4CJjbxnqmASek64cDj1N8PcJngOVp/3LginT9TOCbgICTgfvHuN4PAV8F7kjbNwOL0vXrgD9K198PXJeuLwK+NoY1/m/gwnR9EnDEeJtPij+kexL4hdI8nj9e5hP4TeAE4JHSvhHNIXAU8ET698h0/cgxqPOdwMR0/YpSnXPT4/0QYHbKgQljkQn16kz7Z1J8WOQpYGq75zMiDrhAPwVYV9q+GLi43XWV6rkdeAewCZiW9k0DNqXrXwIWl/r/rN8Y1DYDuAv4LeCOdIf7cenB87O5TXfSU9L1iamfxqDGKSkoVbN/XM0ne/8y+qg0P3cA7xpP8wnMqgnKEc0hsBj4Umn/Pv1Gq86atv8IrErX93msD87pWGVCvTopvubkLcBm9gZ6W+fzQFtyqfc1BNPbVMs+0svotwL3A50R8XRqegboTNfbWf9fAh8Bfpq2jwZeiIg9dWrZ56scgMGvchhts4Fngf+Vloa+LGky42w+I2Ib8N+BHwJPU8zPesbffJaNdA7Hw2PtfRRnuwxTT1vqlLQQ2BYRD9U0tbXOAy3QxyVJHcBfA/81Il4st0XxdNzWz4ZKOgvYERHr21lHEyZSvLT9q4h4K/ASxfLAz4yT+TyS4gvpZgOvByYDC9pZ00iMhzlsRNIlwB5gVbtrqSXpMOBjwIp211LrQAv0Zr6GYExJOpgizFdFxK1p93ZJ01L7NGBH2t+u+k8Fzpa0GeilWHa5CjhCxVc11NbSrq9y2ApsjYj70/YtFAE/3ubzdODJiHg2Il4FbqWY4/E2n2UjncO2PdYknQ+cBSxJTz4MU0876vwliifzh9JjagbwT5Je1+46D7RAb+ZrCMaMJFH8lexjEfG5UlP5qxDOo1hbH9z/3vRO+MnArtLL4FETERdHxIyImEUxZ9+OiCXA3RRf1VCvzjH/KoeIeAbYIukNadfbKb6meVzNJ8VSy8mSDkv3gcE6x9V81hjpHK4D3inpyPSK5J1p36iStIBiafDsiHi5pv5F6RNDs4E5wD/ShkyIiA0R8YsRMSs9prZSfDjiGdo9n1Uvyo/2heJd5Mcp3tm+pM21nEbx0vVh4HvpcibF+uhdwPeBvweOSv1F8Z+F/ADYAHS1oeZu9n7K5TiKB0U/8HXgkLT/0LTdn9qPG8P6fhV4MM3pbRSfCBh38wl8Avhn4BHgRopPX4yL+QRWU6ztv0oRNhe0MocUa9j96fL7Y1RnP8Va8+Dj6bpS/0tSnZuAM0r7RzUT6tVZ076ZvW+Ktm0+I8J/+m9mlosDbcnFzMyG4EA3M8uEA93MLBMOdDOzTDjQzcwy4UA3M8uEA93MLBP/HxyZ/1kdA/yIAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "str_df.LevDist.hist(bins=50).set_title(\"count v/s Lev edit distances\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "quarterly-shock", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'count v/s Lev edit distances till 20')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEICAYAAABPgw/pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAc/klEQVR4nO3dfZRcVZnv8e/PhJeZNAYwTosJEND4EkEZ0gIKavcSMQRN5s4wXiIiDGCGq3HpnQGNg4PIOAq6wKWIMpFhRSGkeRleciEauDP0cB2MQ6JACBkwIEhiSJRAxwYUgs/94+zOFJV663rPye+zVq0+p/beZz+9z6mnTu2qU6WIwMzMdn6v6HQAZmbWHE7oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07otsuSdL6kq9PyAZJGJI1rYHuPSTo2Lf+dpCuaFWs3S+N2cFpeJOlLablf0vrORrdrcULfBRUmnga2MVfSNe3ut1Ui4pcR0RMRLwFIGpJ0ZgPb+3JEVG3faD/tVireNG6PjnE7fyJpiaRfSRqW9B+Sjiyq82FJj0t6VtLNkvZtxv+QZ07oVq8TgGWdDsJ2Wj3APcAMYF/ge8BtknoAJL0F+CfgFKAXeA74dmdC3YlEhG8dvAH7AzcCvwaeAr6V7n8F8HngcWAz8H1gYirrB9YXbecx4Ni0fD5wXWrzW2AN0JfKrgL+ADwPjACfKRHTWuADBevjU3yHF8S2CZgE7AlcnWJ/huxB2lvmf90eY9H9rwAWAI+k7VwH7JvKfgDML6p/H/DnZfo4Crg7xXIf0F9QdhDw72lM7gC+BVydyqYCkf7XfwReAn6XxuhbZfo6Je2fp4BzS+yD0W2XHKNy/QDfAJ4AtgKrgHcV9Fl231Y6nlLZ6WnfPg0sBw5M9wv4OtlxthVYDRxS4v8tF28Ar0/Li4AvlTtOqzwWtgIz0vKXgWsKyl4HvADs1enHbDffOts5XJkOogdqrP8h4MF0EF/Tytja9P+PS0nn68CE9MA/JpWdDqwDDiY7m7kRuCqV7fBAKZFMfgfMSn18BVhRqm6ZuM4DFhesnwCsLVg/CvhxWv5r4P8Af5z6mgG8ssx2S/YLfApYAUwB9iA7M1uSyj4K/EdB3elkSXGPEtuZnJLYLLInifel9Ven8h8Dl6Q+3k2WEHdI6Gl9CDizwhhNT0nt3Wl7lwDbKJ3Qy45RqX6AjwCvInty+VvgSWDPavuWysfTHLLj6c1pu58H7k5l7yd74tibLLm/GdivzP9dKt6GEzpwWPq/Jqb1W4DPFtUZISV830rfOj3lsgiYWUtFSdOAzwFHR8RbgE+3Lqy2OQJ4LXBORDwbEb+LiB+lspOBSyLi0YgYIfvfT5I0vsZt/ygilkU2J3wV8LYxxHUNMFvSH6f1DwNLCsoLp1teJEs+r4+IlyJiVURsHUNfAGcB50bE+oj4PVnSOjH9rzcBh0k6MNU9Gbgx1Sv2EWBZ+r//EBF3ACuBWZIOAN4O/H1E/D4i7iJLsvU6Ebg1Iu5Ksfw92SufUsY0RhFxdUQ8FRHbIuJisieMNxZUKbdvKx1PZwFfiYi1EbGN7Ax4dFxfBPYC3gQo1dk49iGpj6RXpv/jixExnO7uAYaLqg6nOK2Mjib09KDaUnifpNdJ+qGkVZL+n6Q3paKPAZdFxNOp7eY2h9sK+wOPpwdYsdeSvZwf9TjZmVVvjdt+smD5OWDPWp8MImId2UvzD6akPpssyY+axX8n9KvIXr4Ppje4vipptxpjHHUgcJOkZyQ9k/p+iWzq5rfAbcBJqe5cYHGF7fzl6HbSto4B9iMbz6cj4tmC+o+X2EatXks2LQJA2u5TZeqOaYwknS1pbXqz8BlgItn01qhy+7bS8XQg8I2CcdlCdjY+OSL+jWz66TJgs6SFKcm2nKQ/IntiXRERXykoGgGKY3gl2asqK6PTZ+ilLAQ+GREzgLP57zdC3gC8Ib0bvkJSTWf2Xe4J4IAyifZXZA/CUQeQvaTfBDxL9vIdgPRRu1ePod9avjN5CVnynAM8mJI8kl5DliB/ChARL0bEFyNiOvBO4ANk0yRj8QRwfETsXXDbMyI2FMYi6R1k0wh3VtjOVUXbmRARFwIbgX0kTSiof0CFmKqN0UayBApAeuJ7VckNVR6jl/Uj6V3AZ8imF/eJiL3JzkxVJR6ofDw9Afx10dj8UUTcnWL8ZnrMTSd7rJ1Tpo+mfd+2pD2Am4H1ZNNShdZQ8KoyfSxyD+DhZvWfR12V0NM73O8Erpd0L9lc6n6peDwwjWxebi7wXUl7tz/KpvpPssRwoaQJkvaUdHQqWwL8b0kHpXH5MnBtOvt6mOys7IR0pvd5soO9VpvI5uYrGQSOA/4XLz87Px74YUQ2qSlpQNKh6UllK9nL93JTDwC7pf9z9DYeuBz4x9FpFUmvljSnoM0ysie3C8jGoNz2ryZ7VfF+SePS9vslTYmIx8mmX74oaXdJxwAfrBBntTG6AfiApGMk7Z5iK/l4qjJGxf3sRfbE/WtgvKTz2PFMtZxKx9PlwOfSp0eQNFHSX6blt0s6Mh1Lz5LNZZcb41qOnapSXzeQvTl/aol9uphsX74rPQlfQDbV5jP0CroqoZPF80xEHFZwe3MqWw8sTWc7vyBLatM6FmkTpDnQDwKvB35J9j/+z1R8JdlL9buAX5A9yD6Z2g0DHweuADaQPQjHcgHHV4DPp5ffZ5eJbSPZm4jvBK4tKCr+uOJryB6YW8mmSv49xV3OMrIH8ejtfLJPdSwFbpf0W7I3SLd/JjnNUd8IHMvLn1yKY36C7BXF35ElxCfIzjRHj/MPp+1uAb5A9kmRcr5BNo//tKRvluhrDfCJFM9Gsk+OlNsHlcaouJ/lwA/Jju/Hyfb7EztssYRKx1NE3ARcRDbtsxV4gOzJGbInjO+m/2H0UztfK9NNxXEZg9FXKscBzyi7OGkkvUIZHd+zyBL7ZrInuo830N8uQelEq3MBSFPJ3lw6JK3fDXw9Iq6XJOCtEXFfmmKZGxGnSpoE/Aw4LCLKzVtak6Wz6SeBg+t449PMWqyjZ+iSlpCdBb5R0npJZ5B9iuEMSfeRzaONvvReDjwl6UGyOdRznMzbbl+yT4k4mZt1oY6foZuZWXN02xy6mZnVqdaLVJpu0qRJMXXq1LraPvvss0yYMKF6xTbr1rige2NzXGPjuMYmj3GtWrXqNxFR+mPKzbjctJ7bjBkzol533nln3W1bqVvjiuje2BzX2DiuscljXMDK6NJL/83MrEmc0M3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHLCCd3MLCec0M3McsIJ3cwsJzp26b+Z7Wjqgtu2Lz924QkdjMR2Rk7oZjnhJwPzlIuZWU44oZuZ5UTVhC7pSkmbJT1Qpd7bJW2TdGLzwjMzs1rVcoa+CJhZqUL6NfOLgNubEJOZmdWhakKPiLvIfiW9kk8C/0L269xmZtYBNf2mqKSpwK0RcUiJssnANcAAcGWqd0OZ7cwD5gH09vbOGBwcrCvokZERenp66mrbSt0aF3RvbI7r5VZvGN6+fOjkiTuUV4qrWttW8n4cm0biGhgYWBURfSULy/3yReENmAo8UKbseuCotLwIOLGWbfoXi9qrW2NzXC934Gdv3X4rpVJc1dq2kvfj2LTqF4ua8Tn0PmBQEsAkYJakbRFxcxO2bWZmNWo4oUfEQaPLkhaRTbnc3Oh2zcxsbKomdElLgH5gkqT1wBeA3QAi4vKWRmdmZjWrmtAjYm6tG4uI0xqKxszM6uYrRc3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHLCCd3MLCec0M3McsI/QWfWZP4pOOsUn6GbmeWEE7qZWU44oZuZ5YQTuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU74wiIze9nFUOALonZWPkM3M8sJJ3Qzs5xwQjczy4mqCV3SlZI2S3qgTPnJku6XtFrS3ZLe1vwwzcysmlrO0BcBMyuU/wJ4T0QcCvwDsLAJcZmZ2RhV/ZRLRNwlaWqF8rsLVlcAU5oQl5mZjZEionqlLKHfGhGHVKl3NvCmiDizTPk8YB5Ab2/vjMHBwTEHDDAyMkJPT09dbVupW+OC7o0tj3Gt3jC8ffnQyROb2rZSXM3qt572edyPrdRIXAMDA6sioq9UWdMSuqQB4NvAMRHxVLVt9vX1xcqVK6v2XcrQ0BD9/f11tW2lbo0Luje2PMbVyA9cVGtbKa5m9VtP+zzux1ZqJC5JZRN6Uy4skvRW4Arg+FqSuZmZNV/DH1uUdABwI3BKRDzceEhmnbd6wzBTF9y2w5mrWTereoYuaQnQD0yStB74ArAbQERcDpwHvAr4tiSAbeVeDpiZWevU8imXuVXKzwRKvglqZmbt4ytFzcxywgndzCwnnNDNzHLCCd3MLCec0M3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHLCCd3MLCec0M3McqIpX59r1m0a+W5ws52Vz9DNzHLCCd3MLCec0M3McsIJ3cwsJ5zQzcxywgndzCwnnNDNzHKiakKXdKWkzZIeKFMuSd+UtE7S/ZIOb36YZmZWTS1n6IuAmRXKjwempds84DuNh2VmZmNVNaFHxF3AlgpV5gDfj8wKYG9J+zUrQDMzq40ionolaSpwa0QcUqLsVuDCiPhRWv9X4LMRsbJE3XlkZ/H09vbOGBwcrCvokZERenp66mrbSt0aF3RvbK2Ka/WG4e3Lh06eOOb2m7cMs+n5+to30ne1tpXGq1n91tN+Vzu+GtVIXAMDA6sioq9UWVu/yyUiFgILAfr6+qK/v7+u7QwNDVFv21bq1rige2NrVVynFX6Xy8lj3/6li2/h4tXj62rfSN/V2lYar2b1W0/7Xe34alSr4mrGp1w2APsXrE9J95mZWRs1I6EvBT6aPu1yFDAcERubsF0zMxuDqlMukpYA/cAkSeuBLwC7AUTE5cAyYBawDngO+KtWBWtmZuVVTegRMbdKeQCfaFpEZmZWF18pamYNW71hmKkLbnvZD4tY+zmhm5nlhBO6mVlOOKGbmeWEfyTaupZ/6NlsbHyGbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTNSV0STMlPSRpnaQFJcoPkHSnpJ9Jul/SrOaHamZmlVT9gQtJ44DLgPcB64F7JC2NiAcLqn0euC4iviNpOrAMmNqCeG0n4x+pMGufWs7QjwDWRcSjEfECMAjMKaoTwCvT8kTgV80L0czMaqGIqFxBOhGYGRFnpvVTgCMjYn5Bnf2A24F9gAnAsRGxqsS25gHzAHp7e2cMDg7WFfTIyAg9PT11tW2lbo0LOhfb6g3D25cPnTxxh/JKcVVr20i/1WzeMsym59vfdzeMVz3tGxmvVurWx2QjcQ0MDKyKiL5SZc36TdG5wKKIuFjSO4CrJB0SEX8orBQRC4GFAH19fdHf319XZ0NDQ9TbtpW6NS7oXGynFU65nLxj/5Xiqta2kX6ruXTxLVy8enzb++6G8aqnfSPj1Urd+phsVVy1TLlsAPYvWJ+S7it0BnAdQET8GNgTmNSMAM3MrDa1JPR7gGmSDpK0O3ASsLSozi+B9wJIejNZQv91MwM1M7PKqib0iNgGzAeWA2vJPs2yRtIFkmanan8LfEzSfcAS4LSoNjlvZmZNVdMcekQsI/soYuF95xUsPwgc3dzQzMxsLHylqJlZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU44oZuZ5YQTuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU406/vQzczq4p8pbB6foZuZ5YQTuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU44oZuZ5URNCV3STEkPSVonaUGZOh+S9KCkNZKuaW6YZmZWTdULiySNAy4D3gesB+6RtDT9MPRonWnA54CjI+JpSX/SqoDNzKy0Ws7QjwDWRcSjEfECMAjMKarzMeCyiHgaICI2NzdMMzOrRhFRuYJ0IjAzIs5M66cAR0bE/II6NwMPA0cD44DzI+KHJbY1D5gH0NvbO2NwcLCuoEdGRujp6amrbSt1a1zQudhWbxjevnzo5Ik7lFeKq1rbRvqtZvOWYTY93/6+u2G86mnfqfGqplsfk43ENTAwsCoi+kqVNeu7XMYD04B+YApwl6RDI+KZwkoRsRBYCNDX1xf9/f11dTY0NES9bVupW+OCzsV2WuH3dJy8Y/+V4qrWtpF+q7l08S1cvHp82/vuhvGqp32nxquabn1MtiquWqZcNgD7F6xPSfcVWg8sjYgXI+IXZGfr05oTopmZ1aKWhH4PME3SQZJ2B04ClhbVuZns7BxJk4A3AI82L0wzM6umakKPiG3AfGA5sBa4LiLWSLpA0uxUbTnwlKQHgTuBcyLiqVYFbWZmO6ppDj0ilgHLiu47r2A5gL9JNzMz6wBfKWpmlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhPN+sUiM7O2m1r4a0cXntDBSLqDz9DNzHLCCd3MLCec0M3McsIJ3cwsJ5zQzcxywgndzCwnakrokmZKekjSOkkLKtT7C0khqa95IZqZWS2qJnRJ44DLgOOB6cBcSdNL1NsL+BTwk2YHaWZm1dVyYdERwLqIeBRA0iAwB3iwqN4/ABcB5zQ1Qus4X7xhtnNQRFSuIJ0IzIyIM9P6KcCRETG/oM7hwLkR8ReShoCzI2JliW3NA+YB9Pb2zhgcHKwr6JGREXp6eupq20rdGhc0FtvqDcPblw+dPLGpbSvF1cp+q9m8ZZhNz7e/724Yr3ra74zj1UmNxDUwMLAqIkpOazd86b+kVwCXAKdVqxsRC4GFAH19fdHf319Xn0NDQ9TbtpW6NS5oLLbTCs/QTx7bNqq1rRRXK/ut5tLFt3Dx6vFt77sbxque9jvjeHVSq+Kq5U3RDcD+BetT0n2j9gIOAYYkPQYcBSz1G6NmZu1VS0K/B5gm6SBJuwMnAUtHCyNiOCImRcTUiJgKrABml5pyMTOz1qma0CNiGzAfWA6sBa6LiDWSLpA0u9UBmplZbWqaQ4+IZcCyovvOK1O3v/GwzMxsrHylqJlZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU74R6LNbJeUx+8o8hm6mVlOOKGbmeWEE7qZWU44oZuZ5YQTuplZTuySn3LJ47vbZmY+QzczywkndDOznNglp1x2Rp4mMrNqfIZuZpYTTuhmZjnhhG5mlhNO6GZmOVHTm6KSZgLfAMYBV0TEhUXlfwOcCWwDfg2cHhGPNzlWM7Ou0Y0fVKh6hi5pHHAZcDwwHZgraXpRtZ8BfRHxVuAG4KvNDtTMzCqrZcrlCGBdRDwaES8Ag8CcwgoRcWdEPJdWVwBTmhummZlVo4ioXEE6EZgZEWem9VOAIyNifpn63wKejIgvlSibB8wD6O3tnTE4OFhX0CMjI/T09NTVFmD1huHty4dOnlj3doo1GlcljcbcSGyN9F2tbaW4WtlvNZu3DLPp+fb33Q3jVU/7XW28Gm3fyONxYGBgVUT0lSpr6oVFkj4C9AHvKVUeEQuBhQB9fX3R399fVz9DQ0PU2xbgtMK5r5Pr306xRuOqpNGYL118Cxf/6Nms/Rjn+xrpu1rbSmPWyn6ruXTxLVy8enzb++6G8aqn/a42Xo22b1WuqCWhbwD2L1ifku57GUnHAucC74mI3zcnPDMzq1Utc+j3ANMkHSRpd+AkYGlhBUl/CvwTMDsiNjc/TDMzq6ZqQo+IbcB8YDmwFrguItZIukDS7FTta0APcL2keyUtLbM5MzNrkZrm0CNiGbCs6L7zCpaPbXJcZmY2Rr5S1MwsJ/z1uWZmbVB4ZemimRNa0ocT+hh14+W+ZmbgKRczs9xwQjczywkndDOznHBCNzPLCb8p2kZ+Q9XMWsln6GZmOeGEbmaWE07oZmY54YRuZpYTTuhmZjnhhG5mlhNO6GZmObFTJvTVG4aZuuC2l32u28xsV7dTJnQzM9uRE7qZWU44oZuZ5YQTuplZTtSU0CXNlPSQpHWSFpQo30PStan8J5KmNj1SMzOrqGpClzQOuAw4HpgOzJU0vajaGcDTEfF64OvARc0O1MzMKqvlDP0IYF1EPBoRLwCDwJyiOnOA76XlG4D3SlLzwjQzs2oUEZUrSCcCMyPizLR+CnBkRMwvqPNAqrM+rT+S6vymaFvzgHlp9Y3AQ3XGPQn4TdVa7detcUH3xua4xsZxjU0e4zowIl5dqqCtP3AREQuBhY1uR9LKiOhrQkhN1a1xQffG5rjGxnGNza4WVy1TLhuA/QvWp6T7StaRNB6YCDzVjADNzKw2tST0e4Bpkg6StDtwErC0qM5S4NS0fCLwb1FtLsfMzJqq6pRLRGyTNB9YDowDroyINZIuAFZGxFLgn4GrJK0DtpAl/VZqeNqmRbo1Luje2BzX2Diusdml4qr6pqiZme0cfKWomVlOOKGbmeVEVyf0bvzKAUn7S7pT0oOS1kj6VIk6/ZKGJd2bbue1Oq7U72OSVqc+V5Yol6RvpvG6X9LhbYjpjQXjcK+krZI+XVSnbeMl6UpJm9O1E6P37SvpDkk/T3/3KdP21FTn55JOLVWnyXF9TdJ/pX11k6S9y7StuN9bENf5kjYU7K9ZZdpWfPy2IK5rC2J6TNK9Zdq2ZLzK5Ya2Hl8R0ZU3sjdgHwEOBnYH7gOmF9X5OHB5Wj4JuLYNce0HHJ6W9wIeLhFXP3BrB8bsMWBShfJZwA8AAUcBP+nAPn2S7MKIjowX8G7gcOCBgvu+CixIywuAi0q02xd4NP3dJy3v0+K4jgPGp+WLSsVVy35vQVznA2fXsK8rPn6bHVdR+cXAee0cr3K5oZ3HVzefoXflVw5ExMaI+Gla/i2wFpjcyj6baA7w/cisAPaWtF8b+38v8EhEPN7GPl8mIu4i+yRWocLj6HvAn5Vo+n7gjojYEhFPA3cAM1sZV0TcHhHb0uoKsmtA2qrMeNWilsdvS+JKOeBDwJJm9VdjTOVyQ9uOr25O6JOBJwrW17Nj4txeJx34w8Cr2hIdkKZ4/hT4SYnid0i6T9IPJL2lTSEFcLukVcq+ZqFYLWPaSidR/kHWifEa1RsRG9Pyk0BviTqdHrvTyV5dlVJtv7fC/DQVdGWZKYROjte7gE0R8fMy5S0fr6Lc0Lbjq5sTeleT1AP8C/DpiNhaVPxTsmmFtwGXAje3KaxjIuJwsm/G/ISkd7ep36qUXZQ2G7i+RHGnxmsHkb3+7arP8ko6F9gGLC5Tpd37/TvA64DDgI1k0xvdZC6Vz85bOl6VckOrj69uTuhd+5UDknYj22GLI+LG4vKI2BoRI2l5GbCbpEmtjisiNqS/m4GbyF72FqplTFvleOCnEbGpuKBT41Vg0+jUU/q7uUSdjoydpNOADwAnp2Swgxr2e1NFxKaIeCki/gB8t0x/nRqv8cCfA9eWq9PK8SqTG9p2fHVzQu/KrxxI83P/DKyNiEvK1HnN6Fy+pCPIxrmlTzSSJkjaa3SZ7A21B4qqLQU+qsxRwHDBS8FWK3vW1InxKlJ4HJ0K3FKiznLgOEn7pCmG49J9LSNpJvAZYHZEPFemTi37vdlxFb7v8j/K9FfL47cVjgX+K9I3vxZr5XhVyA3tO76a/U5vk981nkX2TvEjwLnpvgvIDnCAPclewq8D/hM4uA0xHUP2kul+4N50mwWcBZyV6swH1pC9s78CeGcb4jo49Xdf6nt0vArjEtmPlTwCrAb62rQfJ5Al6IkF93VkvMieVDYCL5LNU55B9r7LvwI/B/4vsG+q2wdcUdD29HSsrQP+qg1xrSObVx09zkY/0fVaYFml/d7iuK5Kx8/9ZMlqv+K40voOj99WxpXuXzR6XBXUbct4VcgNbTu+fOm/mVlOdPOUi5mZjYETuplZTjihm5nlhBO6mVlOOKGbmeWEE7qZWU44oZuZ5cT/B3ZNeWmC500qAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "str_df.LevDist[str_df.LevDist <= 20].hist(bins=100).set_title(\"count v/s Lev edit distances till 20\")" ] }, { "cell_type": "code", "execution_count": 33, "id": "entire-candle", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "list index out of range\r\n" ] } ], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_string.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_str_new_vals.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 label \\\n", " --filter-keys node1 label \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv" ] }, { "cell_type": "code", "execution_count": 35, "id": "similar-nevada", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "16922584 ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_str_not_updated.tsv" ] }, { "cell_type": "markdown", "id": "administrative-barbados", "metadata": {}, "source": [ "### Dates Comparison" ] }, { "cell_type": "code", "execution_count": 63, "id": "creative-office", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-03-15 01:44:30 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_22_c1.\"node1\", graph_22_c1.\"label\", graph_22_c1.\"node2\", graph_24_c2.\"label\" \"_aLias.newNode2Label\", graph_24_c2.\"node2\" \"_aLias.newNode2\"\n", " FROM graph_22 AS graph_22_c1, graph_24 AS graph_24_c2\n", " WHERE graph_22_c1.\"node1\"=graph_24_c2.\"node1\"\n", " AND (graph_22_c1.\"label\" = graph_24_c2.\"label\")\n", " PARAS: []\n", "---------------------------------------------\n", "[2021-03-15 01:44:30 sqlstore]: CREATE INDEX on table graph_22 column node1 ...\n", "[2021-03-15 01:44:33 sqlstore]: ANALYZE INDEX on table graph_22 column node1 ...\n", "[2021-03-15 01:44:34 sqlstore]: CREATE INDEX on table graph_24 column node1 ...\n", "[2021-03-15 01:45:08 sqlstore]: ANALYZE INDEX on table graph_24 column node1 ...\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_date.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/claims.time.tsv.gz \\\n", " --match \"node2: (x)-[r]->(y), time: (x)-[s]->(z)\" \\\n", " --where \"r.label = s.label\" \\\n", " --return 'x, r.label, y, s.label as newNode2Label, z as newNode2' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_date_new_vals_rightone.tsv\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "sophisticated-glance", "metadata": {}, "outputs": [], "source": [ "# from dateutil.parser import parse\n", "# import re\n", "# import rltk\n", "# from rltk.similarity import levenshtein_distance as ld\n", "# from nltk.tokenize import word_tokenize as wt\n", "# from tqdm.notebook import tqdm\n", "\n", "# f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date.tsv\",\"r\").read().split('\\n')\n", "# fStr = open(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date_measured.tsv\",\"w\")\n", "\n", "# firstLine = f1[0]\n", "\n", "# fStr.write(firstLine+\"\\tSameDate\\n\")\n", "\n", "# for i in tqdm(range(1, len(f1)-1)):\n", "# line = f1[i]\n", "# val1 = line.split(\"\\t\")[2]\n", "# val2 = line.split(\"\\t\")[4]\n", "# val2 = val2[1:-1]\n", "# versionBool = bool(re.fullmatch(\"[\\d\\.]+[\\w\\s\\d]*\",val1))\n", "# rangeBool = bool(re.fullmatch(\"[\\d]+[-|–][\\d]+\",val1))\n", "# LevDist = ld(val1,val2)\n", "# rearranged = set(wt(val1)) == set(wt(val2))\n", "# rearrangedFirstNP = set(wt(val1)) == set(wt(val2[1:]))\n", "# fStr.write(line+ \"\\t\" + str(versionBool) + \"\\t\" + str(rangeBool) + \"\\t\" + \\\n", "# str(LevDist) + \"\\t\" + str(rearranged) + \"\\t\" + str(rearrangedFirstNP) + \"\\n\")\n", "\n", "# fStr.close()" ] }, { "cell_type": "code", "execution_count": 1, "id": "identified-calculation", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "date_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_date_new_vals_rightone.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 2, "id": "federal-cursor", "metadata": {}, "outputs": [], "source": [ "# date_df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_new_vals_date.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 3, "id": "infinite-handbook", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2newNode2LabelnewNode2
0Q1004723P569^00000001887-00-00T00:00:00Z/9P569^1887-01-01T00:00:00Z/9
1Q102084P569^00000001093-00-00T00:00:00Z/9P569^1093-01-01T00:00:00Z/9
2Q10272460P570^00000001917-00-00T00:00:00Z/9P570^1919-03-06T00:00:00Z/11
3Q10289892P569^00000001953-00-00T00:00:00Z/9P569^1953-01-01T00:00:00Z/9
4Q1029352P569^00000001893-00-00T00:00:00Z/9P569^1893-01-20T00:00:00Z/11
\n", "
" ], "text/plain": [ " node1 label node2 newNode2Label \\\n", "0 Q1004723 P569 ^00000001887-00-00T00:00:00Z/9 P569 \n", "1 Q102084 P569 ^00000001093-00-00T00:00:00Z/9 P569 \n", "2 Q10272460 P570 ^00000001917-00-00T00:00:00Z/9 P570 \n", "3 Q10289892 P569 ^00000001953-00-00T00:00:00Z/9 P569 \n", "4 Q1029352 P569 ^00000001893-00-00T00:00:00Z/9 P569 \n", "\n", " newNode2 \n", "0 ^1887-01-01T00:00:00Z/9 \n", "1 ^1093-01-01T00:00:00Z/9 \n", "2 ^1919-03-06T00:00:00Z/11 \n", "3 ^1953-01-01T00:00:00Z/9 \n", "4 ^1893-01-20T00:00:00Z/11 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_df.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "established-joining", "metadata": {}, "outputs": [], "source": [ "def parseDate(str):\n", "# try:\n", " if str == '' or str == \" \": return []\n", " elems = []\n", " toFetchI = 1\n", " dash1 = str.find(\"-\",toFetchI)\n", " toFetchI = dash1 + 1\n", " elems.append(int(str[:dash1]))\n", "\n", " dash2 = str.find(\"-\",toFetchI)\n", " toFetchI = dash2 + 1\n", " elems.append(int(str[dash1+1:dash2]))\n", "\n", " dashT = str.find(\"T\",toFetchI)\n", " toFetchI = dashT + 1\n", " elems.append(int(str[dash2+1:dashT]))\n", "\n", " dashC = str.find(\":\",toFetchI)\n", " toFetchI = dashC + 1\n", " elems.append(int(str[dashT+1:dashC]))\n", "\n", " dashC2 = str.find(\":\",toFetchI)\n", " toFetchI = dashC2 + 1\n", " elems.append(int(str[dashC+1:dashC2]))\n", "\n", " dashZ = str.find(\"Z\",toFetchI)\n", " toFetchI = dashZ + 2\n", " elems.append(int(str[dashC2+1:dashZ]))\n", "\n", " elems.append(int(str[toFetchI:]))\n", " return elems\n", "# except:\n", "# print(str)\n", "# return []\n", " " ] }, { "cell_type": "code", "execution_count": 5, "id": "lucky-gossip", "metadata": {}, "outputs": [], "source": [ "import datetime\n", "def validateDate(elems):\n", " if len(elems) == 0:\n", " return False\n", " precision = elems[-1]\n", "# assert precision >= 9\n", " elems = elems[:-1]\n", "# if precision == 14: #second\n", "# lastIndex = 6\n", "# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n", "# elif precision == 13: #minute\n", "# lastIndex = 5\n", "# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n", "# elif precision == 12: #hour\n", "# lastIndex = 4\n", "# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n", "# elif precision == 11: #day\n", "# lastIndex = 3\n", "# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n", "# elif precision == 10: #month\n", "# lastIndex = 2\n", "# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n", "# elif precision <= 9: #year\n", "# lastIndex = 1\n", "# status = all([elem !=0 for elem in elems[:lastIndex]]) and all([elem ==0 for elem in elems[lastIndex:]])\n", " if elems[1] == 0: elems[1] = 1\n", " if elems[2] == 0: elems[2] = 1\n", " \n", " if elems[0] < 1970 or elems[0] > 9999: \n", " if elems[0] % 400 == 0 or (elems[0] % 4 == 0 and elems[0] % 100 != 0):\n", " elems[0] = 1972\n", " else:\n", " elems[0] = 1970\n", " if precision < 0 or precision > 14:\n", " return False\n", " try:\n", " datetime.datetime(*elems)\n", " return True\n", " except:\n", " return False\n", " return status" ] }, { "cell_type": "code", "execution_count": 6, "id": "executed-theater", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "validateDate(parseDate(\"1887-00-00T00:00:00Z/9\"))" ] }, { "cell_type": "code", "execution_count": 7, "id": "enormous-carpet", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "datetime.datetime(1948, 2, 29, 0, 0, 0, 11)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datetime.datetime(*[1948, 2, 29, 0, 0, 0, 11])" ] }, { "cell_type": "code", "execution_count": 8, "id": "complete-index", "metadata": {}, "outputs": [], "source": [ "date_df['parsed_date'] = date_df['node2'].apply(lambda x: parseDate(x[1:]))\n", "date_df['parsed_date2'] = date_df['newNode2'].apply(lambda x: parseDate(x[1:]))\n", "date_df['valid_date'] = date_df['node2'].apply(lambda x: validateDate(parseDate(x[1:])))\n", "date_df['same_date'] = date_df.apply(lambda p: p.parsed_date == p.parsed_date2, axis=1)\n", "date_df['str_same_date'] = date_df.apply(lambda p: p.node2 == p.newNode2, axis=1)" ] }, { "cell_type": "code", "execution_count": 9, "id": "surface-warehouse", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4711733" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(date_df)" ] }, { "cell_type": "code", "execution_count": 10, "id": "diagnostic-satellite", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2newNode2LabelnewNode2parsed_dateparsed_date2valid_datesame_datestr_same_date
5950Q1285220P570^00000001979-02-29T00:00:00Z/11P570^1979-03-29T00:00:00Z/11[1979, 2, 29, 0, 0, 0, 11][1979, 3, 29, 0, 0, 0, 11]FalseFalseFalse
5973Q165823P569^00000001900-02-29T00:00:00Z/11P569^1900-03-13T00:00:00Z/11[1900, 2, 29, 0, 0, 0, 11][1900, 3, 13, 0, 0, 0, 11]FalseFalseFalse
6073Q481471P569^00000001762-02-29T00:00:00Z/11P569^1762-02-28T00:00:00Z/11[1762, 2, 29, 0, 0, 0, 11][1762, 2, 28, 0, 0, 0, 11]FalseFalseFalse
6233Q16097212P569^00000001935-06-31T00:00:00Z/11P569^1935-01-01T00:00:00Z/9[1935, 6, 31, 0, 0, 0, 11][1935, 1, 1, 0, 0, 0, 9]FalseFalseFalse
61707Q10717720P576^00000001995-06-31T00:00:00Z/11P576^1995-06-31T00:00:00Z/11[1995, 6, 31, 0, 0, 0, 11][1995, 6, 31, 0, 0, 0, 11]FalseTrueFalse
.................................
4653389Q27267640P569^1989-02-29T00:00:00Z/11P569^1989-02-00T00:00:00Z/10[1989, 2, 29, 0, 0, 0, 11][1989, 2, 0, 0, 0, 0, 10]FalseFalseFalse
4674014Q2379398P569^1518-04-31T00:00:00Z/11P569^1518-05-01T00:00:00Z/11[1518, 4, 31, 0, 0, 0, 11][1518, 5, 1, 0, 0, 0, 11]FalseFalseFalse
4674015Q2379398P569^1518-04-31T00:00:00Z/11P569^1518-00-00T00:00:00Z/9[1518, 4, 31, 0, 0, 0, 11][1518, 0, 0, 0, 0, 0, 9]FalseFalseFalse
4679134Q10932215P569^1938-02-30T00:00:00Z/11P569^1938-02-00T00:00:00Z/10[1938, 2, 30, 0, 0, 0, 11][1938, 2, 0, 0, 0, 0, 10]FalseFalseFalse
4684514Q6447447P570^1875-02-29T00:00:00Z/11P570^1875-02-05T00:00:00Z/11[1875, 2, 29, 0, 0, 0, 11][1875, 2, 5, 0, 0, 0, 11]FalseFalseFalse
\n", "

186 rows × 10 columns

\n", "
" ], "text/plain": [ " node1 label node2 newNode2Label \\\n", "5950 Q1285220 P570 ^00000001979-02-29T00:00:00Z/11 P570 \n", "5973 Q165823 P569 ^00000001900-02-29T00:00:00Z/11 P569 \n", "6073 Q481471 P569 ^00000001762-02-29T00:00:00Z/11 P569 \n", "6233 Q16097212 P569 ^00000001935-06-31T00:00:00Z/11 P569 \n", "61707 Q10717720 P576 ^00000001995-06-31T00:00:00Z/11 P576 \n", "... ... ... ... ... \n", "4653389 Q27267640 P569 ^1989-02-29T00:00:00Z/11 P569 \n", "4674014 Q2379398 P569 ^1518-04-31T00:00:00Z/11 P569 \n", "4674015 Q2379398 P569 ^1518-04-31T00:00:00Z/11 P569 \n", "4679134 Q10932215 P569 ^1938-02-30T00:00:00Z/11 P569 \n", "4684514 Q6447447 P570 ^1875-02-29T00:00:00Z/11 P570 \n", "\n", " newNode2 parsed_date \\\n", "5950 ^1979-03-29T00:00:00Z/11 [1979, 2, 29, 0, 0, 0, 11] \n", "5973 ^1900-03-13T00:00:00Z/11 [1900, 2, 29, 0, 0, 0, 11] \n", "6073 ^1762-02-28T00:00:00Z/11 [1762, 2, 29, 0, 0, 0, 11] \n", "6233 ^1935-01-01T00:00:00Z/9 [1935, 6, 31, 0, 0, 0, 11] \n", "61707 ^1995-06-31T00:00:00Z/11 [1995, 6, 31, 0, 0, 0, 11] \n", "... ... ... \n", "4653389 ^1989-02-00T00:00:00Z/10 [1989, 2, 29, 0, 0, 0, 11] \n", "4674014 ^1518-05-01T00:00:00Z/11 [1518, 4, 31, 0, 0, 0, 11] \n", "4674015 ^1518-00-00T00:00:00Z/9 [1518, 4, 31, 0, 0, 0, 11] \n", "4679134 ^1938-02-00T00:00:00Z/10 [1938, 2, 30, 0, 0, 0, 11] \n", "4684514 ^1875-02-05T00:00:00Z/11 [1875, 2, 29, 0, 0, 0, 11] \n", "\n", " parsed_date2 valid_date same_date str_same_date \n", "5950 [1979, 3, 29, 0, 0, 0, 11] False False False \n", "5973 [1900, 3, 13, 0, 0, 0, 11] False False False \n", "6073 [1762, 2, 28, 0, 0, 0, 11] False False False \n", "6233 [1935, 1, 1, 0, 0, 0, 9] False False False \n", "61707 [1995, 6, 31, 0, 0, 0, 11] False True False \n", "... ... ... ... ... \n", "4653389 [1989, 2, 0, 0, 0, 0, 10] False False False \n", "4674014 [1518, 5, 1, 0, 0, 0, 11] False False False \n", "4674015 [1518, 0, 0, 0, 0, 0, 9] False False False \n", "4679134 [1938, 2, 0, 0, 0, 0, 10] False False False \n", "4684514 [1875, 2, 5, 0, 0, 0, 11] False False False \n", "\n", "[186 rows x 10 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_df[date_df['valid_date'] == False]" ] }, { "cell_type": "code", "execution_count": 11, "id": "seventh-sister", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2newNode2LabelnewNode2parsed_dateparsed_date2valid_datesame_datestr_same_date
116Q12260242P569^00000001964-00-00T00:00:00Z/9P569^1964-00-00T00:00:00Z/9[1964, 0, 0, 0, 0, 0, 9][1964, 0, 0, 0, 0, 0, 9]TrueTrueFalse
134Q12352405P569^00000001987-03-19T00:00:00Z/11P569^1987-03-19T00:00:00Z/11[1987, 3, 19, 0, 0, 0, 11][1987, 3, 19, 0, 0, 0, 11]TrueTrueFalse
273Q16506839P569^00000001718-01-01T00:00:00Z/9P569^1718-01-01T00:00:00Z/9[1718, 1, 1, 0, 0, 0, 9][1718, 1, 1, 0, 0, 0, 9]TrueTrueFalse
291Q1686296P571^00000002013-01-01T00:00:00Z/11P571^2013-01-01T00:00:00Z/11[2013, 1, 1, 0, 0, 0, 11][2013, 1, 1, 0, 0, 0, 11]TrueTrueFalse
390Q258257P569^00000001140-00-00T00:00:00Z/9P569^1140-00-00T00:00:00Z/9[1140, 0, 0, 0, 0, 0, 9][1140, 0, 0, 0, 0, 0, 9]TrueTrueFalse
.................................
4711728Q99767269P569^1980-06-11T00:00:00Z/11P569^1980-06-11T00:00:00Z/11[1980, 6, 11, 0, 0, 0, 11][1980, 6, 11, 0, 0, 0, 11]TrueTrueTrue
4711729Q99824424P569^1998-02-10T00:00:00Z/11P569^1998-02-10T00:00:00Z/11[1998, 2, 10, 0, 0, 0, 11][1998, 2, 10, 0, 0, 0, 11]TrueTrueTrue
4711730Q99858723P570^1908-01-01T00:00:00Z/9P570^1908-01-01T00:00:00Z/9[1908, 1, 1, 0, 0, 0, 9][1908, 1, 1, 0, 0, 0, 9]TrueTrueTrue
4711731Q99859256P569^1976-12-03T00:00:00Z/11P569^1976-12-03T00:00:00Z/11[1976, 12, 3, 0, 0, 0, 11][1976, 12, 3, 0, 0, 0, 11]TrueTrueTrue
4711732Q99945100P571^2015-00-00T00:00:00Z/9P571^2015-00-00T00:00:00Z/9[2015, 0, 0, 0, 0, 0, 9][2015, 0, 0, 0, 0, 0, 9]TrueTrueTrue
\n", "

2912668 rows × 10 columns

\n", "
" ], "text/plain": [ " node1 label node2 newNode2Label \\\n", "116 Q12260242 P569 ^00000001964-00-00T00:00:00Z/9 P569 \n", "134 Q12352405 P569 ^00000001987-03-19T00:00:00Z/11 P569 \n", "273 Q16506839 P569 ^00000001718-01-01T00:00:00Z/9 P569 \n", "291 Q1686296 P571 ^00000002013-01-01T00:00:00Z/11 P571 \n", "390 Q258257 P569 ^00000001140-00-00T00:00:00Z/9 P569 \n", "... ... ... ... ... \n", "4711728 Q99767269 P569 ^1980-06-11T00:00:00Z/11 P569 \n", "4711729 Q99824424 P569 ^1998-02-10T00:00:00Z/11 P569 \n", "4711730 Q99858723 P570 ^1908-01-01T00:00:00Z/9 P570 \n", "4711731 Q99859256 P569 ^1976-12-03T00:00:00Z/11 P569 \n", "4711732 Q99945100 P571 ^2015-00-00T00:00:00Z/9 P571 \n", "\n", " newNode2 parsed_date \\\n", "116 ^1964-00-00T00:00:00Z/9 [1964, 0, 0, 0, 0, 0, 9] \n", "134 ^1987-03-19T00:00:00Z/11 [1987, 3, 19, 0, 0, 0, 11] \n", "273 ^1718-01-01T00:00:00Z/9 [1718, 1, 1, 0, 0, 0, 9] \n", "291 ^2013-01-01T00:00:00Z/11 [2013, 1, 1, 0, 0, 0, 11] \n", "390 ^1140-00-00T00:00:00Z/9 [1140, 0, 0, 0, 0, 0, 9] \n", "... ... ... \n", "4711728 ^1980-06-11T00:00:00Z/11 [1980, 6, 11, 0, 0, 0, 11] \n", "4711729 ^1998-02-10T00:00:00Z/11 [1998, 2, 10, 0, 0, 0, 11] \n", "4711730 ^1908-01-01T00:00:00Z/9 [1908, 1, 1, 0, 0, 0, 9] \n", "4711731 ^1976-12-03T00:00:00Z/11 [1976, 12, 3, 0, 0, 0, 11] \n", "4711732 ^2015-00-00T00:00:00Z/9 [2015, 0, 0, 0, 0, 0, 9] \n", "\n", " parsed_date2 valid_date same_date str_same_date \n", "116 [1964, 0, 0, 0, 0, 0, 9] True True False \n", "134 [1987, 3, 19, 0, 0, 0, 11] True True False \n", "273 [1718, 1, 1, 0, 0, 0, 9] True True False \n", "291 [2013, 1, 1, 0, 0, 0, 11] True True False \n", "390 [1140, 0, 0, 0, 0, 0, 9] True True False \n", "... ... ... ... ... \n", "4711728 [1980, 6, 11, 0, 0, 0, 11] True True True \n", "4711729 [1998, 2, 10, 0, 0, 0, 11] True True True \n", "4711730 [1908, 1, 1, 0, 0, 0, 9] True True True \n", "4711731 [1976, 12, 3, 0, 0, 0, 11] True True True \n", "4711732 [2015, 0, 0, 0, 0, 0, 9] True True True \n", "\n", "[2912668 rows x 10 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_df[date_df['same_date']]" ] }, { "cell_type": "code", "execution_count": 12, "id": "failing-mileage", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No. of deleted statements having exact same date in dataset as of 7th December 2020: 30262\n" ] } ], "source": [ "print(f\"No. of deleted statements having exact same date in dataset as of 7th December 2020: {sum(date_df['str_same_date'])}\")" ] }, { "cell_type": "code", "execution_count": 28, "id": "clean-canon", "metadata": {}, "outputs": [], "source": [ "import sys\n", "def customTimeDelta(date1,date2):\n", " try:\n", "# print(date1,date2)\n", " if date1[0] > sys.maxint or date2[0] > sys.maxint:\n", " return None\n", " if date1 == None or date2 == None:\n", " return None\n", " date1 = datetime.datetime(*date1[:-1])\n", " date2 = datetime.datetime(*date2[:-1])\n", " timeDelta = date1 - date2\n", " return timeDelta\n", " except OverflowError:\n", " return None\n", " except TypeError:\n", " return None\n", " except:\n", " return None" ] }, { "cell_type": "code", "execution_count": 29, "id": "waiting-thumbnail", "metadata": {}, "outputs": [], "source": [ "date_df1 = date_df[(date_df['valid_date'] == True) & (date_df['same_date'] == False)]" ] }, { "cell_type": "code", "execution_count": 30, "id": "superior-gothic", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ ":1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " date_df1['time_delta'] = date_df1.apply(lambda x: customTimeDelta(x.parsed_date, x.parsed_date2), axis=1)\n" ] } ], "source": [ "date_df1['time_delta'] = date_df1.apply(lambda x: customTimeDelta(x.parsed_date, x.parsed_date2), axis=1)" ] }, { "cell_type": "code", "execution_count": 32, "id": "muslim-stephen", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 None\n", "1 None\n", "2 None\n", "3 None\n", "4 None\n", " ... \n", "4711659 None\n", "4711682 None\n", "4711690 None\n", "4711700 None\n", "4711703 None\n", "Name: time_delta, Length: 1798925, dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "date_df1['time_delta']" ] }, { "cell_type": "code", "execution_count": null, "id": "dutch-projection", "metadata": {}, "outputs": [], "source": [ "# !head ../../opAnalysis/removed_statements_both_nonredirects_new_vals_date_measured.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "prepared-magnet", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "relative-tomorrow", "metadata": {}, "source": [ "### Numeric Values Comparison" ] }, { "cell_type": "code", "execution_count": null, "id": "revolutionary-mistake", "metadata": {}, "outputs": [], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/metadata.property.datatypes.tsv.gz \\\n", " --match \"non: (x)-[r{label: property}]->(y), datatypes: (property)-[]->(:quantity)\" \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "eight-haven", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4323460 ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv" ] }, { "cell_type": "code", "execution_count": 2, "id": "unknown-nirvana", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-09 15:19:10 sqlstore]: IMPORT graph directly into table graph_71 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_qty.tsv ...\n", "[2021-04-09 15:19:30 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_71_c1.\"node1\", graph_71_c1.\"label\", graph_71_c1.\"node2\", graph_51_c2.\"label\" \"_aLias.node2;newLabel\", graph_51_c2.\"node2\" \"_aLias.node2;newVal\"\n", " FROM graph_51 AS graph_51_c2, graph_71 AS graph_71_c1\n", " WHERE graph_51_c2.\"node1\"=graph_71_c1.\"node1\"\n", " AND (graph_71_c1.\"label\" = graph_51_c2.\"label\")\n", " PARAS: []\n", "---------------------------------------------\n", "[2021-04-09 15:19:30 sqlstore]: CREATE INDEX on table graph_71 column node1 ...\n", "[2021-04-09 15:19:32 sqlstore]: ANALYZE INDEX on table graph_71 column node1 ...\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_qty.tsv \\\n", " ../../gdrive-kgtk-dump-2020-12-07/claims.quantity.tsv.gz \\\n", " --match \"non: (x)-[r]->(y), quantity: (x)-[s]->(z)\" \\\n", " --where \"r.label = s.label\" \\\n", " --return 'x, r.label, y, s.label as `node2;newLabel`, z as `node2;newVal`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "convertible-softball", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3239699 ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_num.tsv" ] }, { "cell_type": "code", "execution_count": 61, "id": "unlikely-overhead", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "historical-copying", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-09 15:26:38 sqlstore]: IMPORT graph directly into table graph_72 from /data/wd-correctness/opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv ...\n", "[2021-04-09 15:29:43 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", graph_72_c1.\"node2;newLabel\" \"_aLias.node2;newLabel\", max(graph_72_c1.\"node2;newVal\") \"_aLias.node2;newValue\", count(graph_72_c1.\"node2;newVal\") \"_aLias.node2;branching\"\n", " FROM graph_72 AS graph_72_c1\n", " WHERE graph_72_c1.\"node2;newLabel\"=graph_72_c1.\"node2;newLabel\"\n", " AND graph_72_c1.\"node2;newVal\"=graph_72_c1.\"node2;newVal\"\n", " GROUP BY graph_72_c1.\"node1\", graph_72_c1.\"label\", graph_72_c1.\"node2\", \"_aLias.node2;newLabel\"\n", " PARAS: []\n", "---------------------------------------------\n" ] } ], "source": [ "!kgtk --debug query -i ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone2.tsv \\\n", " --match \"(node1)-[r]->(node2{newLabel: newLabel, newVal: newValue})\" \\\n", " --return 'node1, r.label, node2, newLabel as `node2;newLabel`, max(newValue) as `node2;newValue`, count(newValue) as `node2;branching`' \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv" ] }, { "cell_type": "code", "execution_count": 4, "id": "waiting-citizenship", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df1 = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 5, "id": "unlike-huntington", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node2;newLabelnode2;newValuenode2;branching
2501639Q999961P1082+17243[+17243,+17243]P1082+888327
2501640Q999961P1082+6925P1082+888327
2501641Q999961P1082+8653[+8653,+8653]P1082+888327
2501642Q999961P2046+23.95Q712226P2046+23.952616Q7122261
2501643Q999988P2046+1000[+1000,+1000]Q81292P2046+1000Q812921
\n", "
" ], "text/plain": [ " node1 label node2 node2;newLabel \\\n", "2501639 Q999961 P1082 +17243[+17243,+17243] P1082 \n", "2501640 Q999961 P1082 +6925 P1082 \n", "2501641 Q999961 P1082 +8653[+8653,+8653] P1082 \n", "2501642 Q999961 P2046 +23.95Q712226 P2046 \n", "2501643 Q999988 P2046 +1000[+1000,+1000]Q81292 P2046 \n", "\n", " node2;newValue node2;branching \n", "2501639 +8883 27 \n", "2501640 +8883 27 \n", "2501641 +8883 27 \n", "2501642 +23.952616Q712226 1 \n", "2501643 +1000Q81292 1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.tail()" ] }, { "cell_type": "code", "execution_count": 6, "id": "confident-carolina", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "node1\tlabel\tnode2\tnode2;newLabel\tnode2;newValue\tnode2;branching\r\n", "P1733\tP4876\t+1014280\tP4876\t+28977\t1\r\n", "P2040\tP4876\t+34596\tP4876\t+38623\t1\r\n", "P2349\tP4876\t+12367\tP4876\t+12500\t3\r\n", "P2427\tP4876\t+95000\tP4876\t+96793\t4\r\n", "P2518\tP4876\t+11126\tP4876\t+11145\t1\r\n", "P2725\tP4876\t+2232\tP4876\t+3907\t1\r\n", "P2816\tP4876\t+32155\tP4876\t+34149\t2\r\n", "P3289\tP4876\t+113576\tP4876\t+123199\t1\r\n", "P3400\tP4876\t+123817\tP4876\t+123817\t4\r\n" ] } ], "source": [ "!head ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv" ] }, { "cell_type": "code", "execution_count": 7, "id": "adjusted-discretion", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7\n" ] } ], "source": [ "import re\n", "test_str = \"+123817Q\"\n", "temp = re.search(r'[a-z]', test_str, re.I)\n", "if temp is not None:\n", " print(temp.start())\n", "else:\n", " print(\"Not found\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "toxic-heart", "metadata": {}, "outputs": [], "source": [ "def splitIntoParts(text):\n", " temp = re.search(r'[a-z]', text, re.I)\n", " firstAlpha1 = -1 if temp is None else temp.start()\n", " alpha1 = \"\" if firstAlpha1 == -1 else text[firstAlpha1:]\n", " text = text if firstAlpha1 == -1 else text[:firstAlpha1]\n", " \n", " temp = re.search(r'\\[', text, re.I)\n", " firstBracket1 = -1 if temp is None else temp.start()\n", " brack1 = \"\" if firstBracket1 == -1 else text[firstBracket1:]\n", " \n", " num1 = text if firstBracket1 == -1 else text[:firstBracket1]\n", " \n", " return num1, brack1, alpha1" ] }, { "cell_type": "code", "execution_count": 9, "id": "impressed-monthly", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('+1234', '[+1, -1]', 'Q12345')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splitIntoParts(\"+1234[+1, -1]Q12345\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "sunset-fraction", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c86b1765daec4bc084f0c0f399a69dfd", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2501645 [00:00\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtqdm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0mval1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0mval2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"\\t\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mIndexError\u001b[0m: list index out of range" ] } ], "source": [ "from dateutil.parser import parse\n", "import re\n", "import rltk\n", "from rltk.similarity import levenshtein_distance as ld\n", "from nltk.tokenize import word_tokenize as wt\n", "from tqdm.notebook import tqdm\n", "\n", "def is_num(string):\n", " try: \n", " float(string)\n", " return True\n", "\n", " except ValueError:\n", " return False\n", " \n", "f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated2.tsv\",\"r\").read().split(\"\\n\")\n", "fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",\"w\")\n", "firstLine = f1[0]\n", "\n", "fNum.write(firstLine+\"\\tNumNE\\tRangeNE\\tNumNRangeNE\\tUnitNE\\n\")\n", "# fnonQnd.write(f1[0]+\"\\n\")\n", "\n", "for i in tqdm(range(1,len(f1))):\n", " line = f1[i]\n", " val1 = line.split(\"\\t\")[2]\n", " val2 = line.split(\"\\t\")[4]\n", " \n", " \n", " num1, brack1, alpha1 = splitIntoParts(val1)\n", " num2, brack2, alpha2 = splitIntoParts(val2)\n", " \n", "# print(val1, num1, brack1, alpha1)\n", "# print(val2, num2, brack2, alpha2)\n", " \n", " fNum.write(line + \"\\t\" + str(num1 != num2) + \"\\t\" + str(brack1 != brack2) + \"\\t\" + str((num1 != num2) and (brack1 != brack2)) + \"\\t\" + str(alpha1 != alpha2) + \"\\n\")\n", "\n", "fNum.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "continued-landscape", "metadata": {}, "outputs": [], "source": [ "# from dateutil.parser import parse\n", "# import re\n", "# import rltk\n", "# from rltk.similarity import levenshtein_distance as ld\n", "# from nltk.tokenize import word_tokenize as wt\n", "# from tqdm.notebook import tqdm\n", "\n", "# def is_num(string):\n", "# try: \n", "# float(string)\n", "# return True\n", "\n", "# except ValueError:\n", "# return False\n", " \n", "# f1 = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals.tsv\",\"r\").read().split(\"\\n\")\n", "# fNum = open(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_measured.tsv\",\"w\")\n", "\n", "# firstLine = f1[0]\n", "\n", "# fNum.write(firstLine+\"\\tDiff\\tLevDist\\n\")\n", "# # fnonQnd.write(f1[0]+\"\\n\")\n", "\n", "# for i in tqdm(range(1,len(f1))):\n", "# line = f1[i]\n", "# val1 = line.split(\"\\t\")[2]\n", "# val2 = line.split(\"\\t\")[4]\n", "# if is_num(val2):\n", "# diff = float(val2) - float(val1)\n", "# fNum.write(line+ \"\\t\" + str(diff) + \"\\tNone\\n\")\n", "# else:\n", "# LevDist = ld(val1,val2)\n", "# fNum.write(line+ \"\\tNone\\t\" + str(LevDist) + \"\\n\")\n", "\n", "# fNum.close()" ] }, { "cell_type": "code", "execution_count": 11, "id": "impaired-venue", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "num_df = pd.read_csv(\"../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_truncated_measured2.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 12, "id": "strange-alcohol", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
node1labelnode2node2;newLabelnode2;newValuenode2;branchingNumNERangeNENumNRangeNEUnitNE
0P1733P4876+1014280P4876+289771TrueFalseFalseFalse
1P2040P4876+34596P4876+386231TrueFalseFalseFalse
2P2349P4876+12367P4876+125003TrueFalseFalseFalse
3P2427P4876+95000P4876+967934TrueFalseFalseFalse
4P2518P4876+11126P4876+111451TrueFalseFalseFalse
\n", "
" ], "text/plain": [ " node1 label node2 node2;newLabel node2;newValue node2;branching \\\n", "0 P1733 P4876 +1014280 P4876 +28977 1 \n", "1 P2040 P4876 +34596 P4876 +38623 1 \n", "2 P2349 P4876 +12367 P4876 +12500 3 \n", "3 P2427 P4876 +95000 P4876 +96793 4 \n", "4 P2518 P4876 +11126 P4876 +11145 1 \n", "\n", " NumNE RangeNE NumNRangeNE UnitNE \n", "0 True False False False \n", "1 True False False False \n", "2 True False False False \n", "3 True False False False \n", "4 True False False False " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_df.head()" ] }, { "cell_type": "code", "execution_count": 13, "id": "hindu-merit", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "168439415 ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_num_new_vals_rightone1.tsv" ] }, { "cell_type": "code", "execution_count": 14, "id": "hollywood-boring", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 2.501575e+06\n", "mean 6.733284e+01\n", "std 5.003042e+02\n", "min 1.000000e+00\n", "25% 1.000000e+00\n", "50% 2.000000e+00\n", "75% 1.100000e+01\n", "max 2.132100e+04\n", "Name: node2;branching, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_df['node2;branching'].describe()" ] }, { "cell_type": "code", "execution_count": 15, "id": "moral-history", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of 2501575 quantities, there are 1496454 cases where numbers have got updated, 2037283 cases where ranges have got updated, 1069289 cases where number and range both have got updated, 78048 cases were the unit has got updated\n" ] } ], "source": [ "print(f\"Out of {len(num_df)} quantities, there are {num_df['NumNE'].sum()} cases where numbers have got updated, {num_df['RangeNE'].sum()} cases where ranges have got updated, {num_df['NumNRangeNE'].sum()} cases where number and range both have got updated, {num_df['UnitNE'].sum()} cases were the unit has got updated\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "assured-recipient", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "^C\r\n" ] } ], "source": [ "# !kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects.tsv \\\n", "# ../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n", "# --match \"r: (x)-[r]->(y), c: (x)-[s]->(z)\" \\\n", "# --where \"r.label = s.label\" \\\n", "# --return 'x, r.label, y, s.label as node2;newLabl, z as node2;nw' \\\n", "# -o ../../opAnalysis/removed_statements_both_nonredirects_new_vals.tsv" ] }, { "cell_type": "markdown", "id": "muslim-dryer", "metadata": {}, "source": [ "### Qnodes comparison" ] }, { "cell_type": "markdown", "id": "brilliant-picnic", "metadata": {}, "source": [ "#### Qnodes type segregation\n", "\n", "Here, for each qnode to qnode removed statement, we analyze:\n", "* How many statements have node1 which is an instance/subclass/both of something else\n", "* How many statements have node2 which is an instance/subclass/both of something else" ] }, { "cell_type": "code", "execution_count": null, "id": "described-america", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "universal-surprise", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv" ] }, { "cell_type": "code", "execution_count": 60, "id": "elder-tissue", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "killing-emphasis", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "answering-sheriff", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv" ] }, { "cell_type": "code", "execution_count": 61, "id": "intimate-sullivan", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node2 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv" ] }, { "cell_type": "code", "execution_count": 57, "id": "surprising-clone", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15682364 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv" ] }, { "cell_type": "code", "execution_count": 62, "id": "innovative-thread", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3500869 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P279.tsv\n", " 3396316 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31andP279.tsv\n", " 14206459 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1.P31.tsv\n", " 21103644 total\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode1*" ] }, { "cell_type": "code", "execution_count": 63, "id": "accompanied-lighting", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 10064419 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P279.tsv\n", " 6622159 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31andP279.tsv\n", " 12057758 ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2.P31.tsv\n", " 28744336 total\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node_qnode2*" ] }, { "cell_type": "markdown", "id": "verified-vegetable", "metadata": {}, "source": [ "#### Qnodes to Qnodes (instance/subclass analysis)\n", "\n", "Here, we analyze how many P31 relations were deleted, how many were updated to P31/P279/nothing. We do the same thing for P279 relations that were deleted" ] }, { "cell_type": "code", "execution_count": null, "id": "quick-welsh", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --return 'count(a)' \\\n", " --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P31.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "satisfied-philosophy", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P31]->(b)' \\\n", " --graph-cache ~/sqlite3_caches/db1.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "southern-daisy", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P279]->(b)' \\\n", " --return 'count(a)' \\\n", " --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_count_P279.tsv" ] }, { "cell_type": "code", "execution_count": 1, "id": "subtle-tract", "metadata": {}, "outputs": [], "source": [ "!kgtk query -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode.tsv \\\n", " --match 'o: (a)-[:P279]->(b)' \\\n", " --graph-cache ~/sqlite3_caches/db2.sqlite3.db \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv" ] }, { "cell_type": "markdown", "id": "opponent-bible", "metadata": {}, "source": [ "##### Analyze for P31 relations" ] }, { "cell_type": "code", "execution_count": 4, "id": "soviet-liverpool", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv" ] }, { "cell_type": "code", "execution_count": 5, "id": "imposed-pound", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv" ] }, { "cell_type": "code", "execution_count": 16, "id": "provincial-limit", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv" ] }, { "cell_type": "code", "execution_count": 6, "id": "dynamic-persian", "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv \\\n", " ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv" ] }, { "cell_type": "code", "execution_count": 7, "id": "material-routine", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31orP279.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": 18, "id": "aboriginal-injection", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3611396 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n", "2864334 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n", "150123 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n", "106540 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n", "703480 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_newP31andP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": null, "id": "perceived-hopkins", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n", " --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv" ] }, { "cell_type": "code", "execution_count": 1, "id": "antique-neighborhood", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv" ] }, { "cell_type": "code", "execution_count": 2, "id": "alleged-destiny", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 626925 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_deleted.tsv\r\n", " 76556 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_existing.tsv\r\n", " 703481 total\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P31_extract_nothingnew_*" ] }, { "cell_type": "markdown", "id": "opposed-palmer", "metadata": {}, "source": [ "##### Analyze for P279 relations" ] }, { "cell_type": "code", "execution_count": 8, "id": "hybrid-hacker", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P31.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv" ] }, { "cell_type": "code", "execution_count": 9, "id": "reliable-ontario", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv" ] }, { "cell_type": "code", "execution_count": 17, "id": "radio-bumper", "metadata": {}, "outputs": [], "source": [ "!kgtk ifexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n", " --filter-on ../../wikidata-20210215/derived.P279.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv" ] }, { "cell_type": "code", "execution_count": 10, "id": "loving-switzerland", "metadata": {}, "outputs": [], "source": [ "!kgtk cat -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv \\\n", " ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv" ] }, { "cell_type": "code", "execution_count": 11, "id": "prostate-trace", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31orP279.tsv \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": 19, "id": "subsequent-recovery", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "935667 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n", "865917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n", "454917 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n", "421734 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n", "36568 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_newP31andP279.tsv\n", "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "hazardous-liberal", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "^C\r\n", "\r\n", "Keyboard interrupt in ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz --filter-mode NONE --input-keys node1 --filter-keys node1 -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv.\r\n" ] } ], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n", " --filter-on ../../gdrive-kgtk-dump-2020-12-07/claims.tsv.gz \\\n", " --filter-mode NONE \\\n", " --input-keys node1 \\\n", " --filter-keys node1 \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "manual-embassy", "metadata": {}, "outputs": [], "source": [ "!kgtk ifnotexists -i ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew.tsv \\\n", " --filter-on ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv \\\n", " -o ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv" ] }, { "cell_type": "code", "execution_count": 2, "id": "determined-wonder", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 35004 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_deleted.tsv\r\n", " 1565 ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_existing.tsv\r\n", " 36569 total\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/removed_statements_both_nonredirects_node2_qnode_P279_extract_nothingnew_*" ] }, { "cell_type": "code", "execution_count": 5, "id": "hundred-equivalent", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q12328016-P31-Q12737077-46763b70-0\tQ12328016\tP31\tQ12737077\r\n" ] } ], "source": [ "!zgrep -P \"Q12328016\\tP31\" ../../wikidata-20210215/derived.P31.tsv.gz" ] }, { "cell_type": "markdown", "id": "cordless-better", "metadata": {}, "source": [ "# Deprecated Statements Analysis" ] }, { "cell_type": "code", "execution_count": 2, "id": "canadian-broadcast", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-14 17:58:03 sqlstore]: IMPORT graph directly into table graph_75 from /data/wd-correctness/data/deprecated.tsv ...\n", "[2021-04-14 17:58:36 query]: SQL Translation:\n", "---------------------------------------------\n", " SELECT *\n", " FROM graph_75 AS graph_75_c1\n", " WHERE (graph_75_c1.\"label\" IN (?))\n", " PARAS: ['P31']\n", "---------------------------------------------\n" ] } ], "source": [ "!kgtk --debug query -i ../../data/deprecated.tsv \\\n", " --match '(node1)-[prop]->(node2)' \\\n", " --where 'prop.label in [\"P31\"]' \\\n", " -o ../../opAnalysis/deprecated_P31.tsv" ] }, { "cell_type": "code", "execution_count": 3, "id": "blank-capital", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3303205 ../../opAnalysis/deprecated_P31.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/deprecated_P31.tsv" ] }, { "cell_type": "code", "execution_count": 10, "id": "unique-stevens", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "dep_P31_df = pd.read_csv(\"../../opAnalysis/deprecated_P31.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 11, "id": "alternate-snowboard", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Q67206691 2546256\n", "Q523 352194\n", "Q67206785 60055\n", "Q1931185 43618\n", "Q318 35768\n", "Q2247863 21906\n", "Q13890 17533\n", "Q46587 16574\n", "Q6243 13070\n", "Q2154519 12184\n", "Q1153690 10092\n", "Q83373 9998\n", "Q72802727 9948\n", "Q1491746 9106\n", "Q71798532 7641\n", "Name: node2, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dep_P31_df['node2'].value_counts().head(15)" ] }, { "cell_type": "code", "execution_count": 4, "id": "coupled-rochester", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2021-04-14 18:00:30 query]: SQL Translation:\r\n", "---------------------------------------------\r\n", " SELECT *\r\n", " FROM graph_75 AS graph_75_c1\r\n", " WHERE (graph_75_c1.\"label\" IN (?))\r\n", " PARAS: ['P279']\r\n", "---------------------------------------------\r\n" ] } ], "source": [ "!kgtk --debug query -i ../../data/deprecated.tsv \\\n", " --match '(node1)-[prop]->(node2)' \\\n", " --where 'prop.label in [\"P279\"]' \\\n", " -o ../../opAnalysis/deprecated_P279.tsv" ] }, { "cell_type": "code", "execution_count": 5, "id": "bibliographic-wayne", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "307 ../../opAnalysis/deprecated_P279.tsv\r\n" ] } ], "source": [ "!wc -l ../../opAnalysis/deprecated_P279.tsv" ] }, { "cell_type": "code", "execution_count": 12, "id": "caring-gossip", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "dep_P279_df = pd.read_csv(\"../../opAnalysis/deprecated_P279.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 13, "id": "saving-competition", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Q14659 11\n", "Q245932 8\n", "Q27825887 7\n", "Q21451942 6\n", "Q1861967 6\n", "Q1457669 4\n", "Q58840094 4\n", "Q3024240 3\n", "Q26772977 3\n", "Q387917 3\n", "Q192089 3\n", "Q276314 3\n", "Q152574 2\n", "Q209363 2\n", "Q7033037 2\n", "Name: node2, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dep_P279_df['node2'].value_counts().head(15)" ] }, { "cell_type": "code", "execution_count": 15, "id": "critical-pendant", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/nas/home/kshenoy/miniconda3/envs/kgtkEnv/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (7,14) have mixed types.Specify dtype option on import or set low_memory=False.\n", " has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n" ] } ], "source": [ "import pandas as pd\n", "dep_df = pd.read_csv(\"../../data/deprecated.tsv\",sep='\\t')" ] }, { "cell_type": "code", "execution_count": 17, "id": "abstract-disclaimer", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "P31 3303204\n", "P2215 2236125\n", "P2214 2159860\n", "P2216 816191\n", "P2583 461113\n", "P1090 290549\n", "P215 273273\n", "P6879 107265\n", "P7015 66554\n", "P881 55717\n", "Name: label, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dep_df.label.value_counts().head(10)" ] }, { "cell_type": "markdown", "id": "dramatic-spyware", "metadata": {}, "source": [ "Fin." ] }, { "cell_type": "code", "execution_count": null, "id": "general-hometown", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kgtkEnv", "language": "python", "name": "kgtkenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "288px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }