{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "# Parallel Passages in the MT\n", "\n", "# 0. Introduction\n", "\n", "## 0.1 Motivation\n", "We want to make a list of **all** parallel passages in the Masoretic Text (MT) of the Hebrew Bible.\n", "\n", "Here is a quote that triggered Dirk to write this notebook:\n", "\n", "> Finally, the Old Testament Parallels module in Accordance is a helpful resource that enables the researcher to examine 435 sets of parallel texts, or in some cases very similar wording in different texts, in both the MT and translation, but the large number of sets of texts in this database should not fool one to think it is complete or even nearly complete for all parallel writings in the Hebrew Bible.\n", "\n", "Robert Rezetko and Ian Young.\n", " Historical linguistics & Biblical Hebrew. Steps Toward an Integrated Approach.\n", " *Ancient Near East Monographs, Number 9*. SBL Press Atlanta. 2014.\n", " [PDF Open access available](https://www.google.nl/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCgQFjAB&url=http%3A%2F%2Fwww.sbl-site.org%2Fassets%2Fpdfs%2Fpubs%2F9781628370461_OA.pdf&ei=2QSdVf-vAYSGzAPArJeYCg&usg=AFQjCNFA3TymYlsebQ0MwXq2FmJCSHNUtg&sig2=LaXuAC5k3V7fSXC6ZVx05w&bvm=bv.96952980,d.bGQ)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## 0.3 Open Source\n", "This is an IPython notebook.\n", "It contains a working program to carry out the computations needed to obtain the results reported here.\n", "\n", "You can download this notebook and run it on your computer, provided you have\n", "[Text-Fabric](https://github.com/Dans-labs/text-fabric) installed.\n", "\n", "It is a pity that we cannot compare our results with the Accordance resource mentioned above,\n", "since that resource has not been published in an accessible manner.\n", "We also do not have the information how this resource has been constructed on the basis of the raw data.\n", "In contrast with that, we present our results in a completely reproducible manner.\n", "This notebook itself can serve as the method of replication,\n", "provided you have obtained the necessary resources.\n", "See [sources](https://github.com/ETCBC/shebanq/wiki/Sources), which are all Open Access.\n", "\n", "## 0.4 What are parallel passages?\n", "The notion of *parallel passage* is not a simple, straightforward one.\n", "There are parallels on the basis of lexical content in the passages on the one hand,\n", "but on the other hand there are also correspondences in certain syntactical structures,\n", "or even in similarities in text structure.\n", "\n", "In this notebook we do select a straightforward notion of parallel, based on lexical content only.\n", "We investigate two measures of similarity, one that ignores word order completely,\n", "and one that takes word order into account.\n", "\n", "Two kinds of short-comings of this approach must be mentioned:\n", "\n", "1. We will not find parallels based on non-lexical criteria (unless they are also lexical parallels)\n", "1. We will find too many parallels: certain short sentences (and he said), or formula like passages (and the word of God came to Moses) occur so often that they have a more subtle bearing on whether there is a common text history.\n", "\n", "For a more full treatment of parallel passages, see\n", "\n", "**Willem Th. van Peursen and Eep Talstra**:\n", "Computer-Assisted Analysis of Parallel Texts in the Bible -\n", "The Case of 2 Kings xviii-xix and its Parallels in Isaiah and Chronicles.\n", "*Vetus Testamentum* 57, pp. 45-72.\n", "2007, Brill, Leiden.\n", "\n", "Note that our method fails to identify any parallels with `Chronica_II` 32.\n", "Van Peursen and Talstra state about this chapter and 2 Kings 18:\n", "\n", "> These chapters differ so much, that it is sometimes impossible to establish\n", "which verses should be considered parallel.\n", "\n", "In this notebook we produce a set of *cliques*,\n", "a clique being a set of passages that are *quite* similar, based on lexical information.\n", "\n", "\n", "## 0.5 Authors\n", "This notebook is by Dirk Roorda and owes a lot to discussions with Martijn Naaijer.\n", "\n", "[Dirk Roorda](mailto:dirk.roorda@dans.knaw.nl) while discussing ideas with\n", "[Martijn Naaijer](mailto:m.naaijer@vu.nl).\n", "\n", "\n", "## 0.6 Status\n", "\n", "* **modified: 2017-09-28** Is now part of a pipeline for transferring data from the ETCBC to Text-Fabric.\n", "* **modified: 2016-03-03** Added experiments based on chapter chunks and lower similarities.\n", "\n", "165 experiments have been carried out, of which 18 with promising results.\n", "All results can be easily inspected, just by clicking in your browser.\n", "One of the experiments has been chosen as the basis for\n", "[crossref](https://shebanq.ancient-data.org/hebrew/note?version=4b&id=Mnxjcm9zc3JlZg__&tp=txt_tb1&nget=v)\n", "annotations in SHEBANQ.\n", "\n", "# 1. Results\n", "\n", "Click in a green cell to see interesting results. The numbers in the cell indicate\n", "\n", "* the number of passages that have a variant elsewhere\n", "* the number of *cliques* they form (cliques are sets of similar passages)\n", "* the number of passages in the biggest clique\n", "\n", "Below the results is an account of the method that we used, followed by the actual code to produce these results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pipeline\n", "See [operation](https://github.com/ETCBC/pipeline/blob/master/README.md#operation)\n", "for how to run this script in the pipeline.\n", "\n", "The pipeline comes in action in Section [6a](#6a) below: TF features." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Caveat\n", "\n", "This notebook makes use of a new feature of text-fabric, first present in 2.3.15.\n", "Make sure to upgrade first.\n", "\n", "```\n", "sudo -H pip3 install --upgrade text-fabric\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "import re\n", "import collections\n", "import pickle\n", "import math\n", "import difflib\n", "import yaml\n", "from difflib import SequenceMatcher\n", "from IPython.display import HTML\n", "import matplotlib.pyplot as plt\n", "from tf.core.helpers import formatMeta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip3 install python-Levenshtein\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from Levenshtein import ratio" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "import utils\n", "from tf.fabric import Fabric" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "get_ipython().run_line_magic(\"load_ext\", \"autoreload\") # noqa F821\n", "get_ipython().run_line_magic(\"autoreload\", \"2\") # noqa F821\n", "get_ipython().run_line_magic(\"matplotlib\", \"inline\") # noqa F821" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[2]:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "if \"SCRIPT\" not in locals():\n", " # SCRIPT = False\n", " SCRIPT = False\n", " FORCE = True\n", " FORCE_MATRIX = False\n", " LANG_FEATURE = \"languageISO\"\n", " OCC_FEATURE = \"g_cons\"\n", " LEX_FEATURE = \"lex\"\n", " TEXT_FEATURE = \"g_word_utf8\"\n", " TRAILER_FEATURE = \"trailer_utf8\"\n", " CORE_NAME = \"bhsa\"\n", " NAME = \"parallels\"\n", " VERSION = \"2021\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def stop(good=False):\n", " if SCRIPT:\n", " sys.exit(0 if good else 1)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[3]:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "# run this cell after all other cells\n", "if False and not SCRIPT:\n", " HTML(other_exps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Experiments\n", "\n", "We have conducted 165 experiments, all corresponding to a specific choice of parameters.\n", "Every experiment is an attempt to identify variants and collect them in *cliques*.\n", "\n", "The table gives an overview of the experiments conducted.\n", "\n", "Every *row* corresponds to a particular way of chunking and a method of measuring the similarity.\n", "\n", "There are *columns* for each similarity *threshold* that we have tried.\n", "The idea is that chunks are similar if their similarity is above the threshold.\n", "\n", "The outcomes of one experiment have been added to SHEBANQ as the note set\n", "[crossref](https://shebanq.ancient-data.org/hebrew/note?version=4b&id=Mnxjcm9zc3JlZg__&tp=txt_tb1&nget=v).\n", "The experiment chosen for this is currently\n", "\n", "* *chunking*: **object verse**\n", "* *similarity method*: **SET**\n", "* *similarity threshold*: **65**\n", "\n", "\n", "## 2.1 Assessing the outcomes\n", "\n", "Not all experiments lead to useful results.\n", "We have indicated the value of a result by a color coding, based on objective characteristics,\n", "such as the number of parallel passages, the number of cliques, the size of the greatest clique, and the way of chunking.\n", "These numbers are shown in the cells.\n", "\n", "### 2.1.1 Assessment criteria\n", "\n", "If the method is based on *fixed* chunks, we deprecated the method and the results.\n", "Because two perfectly similar verses could be missed if a 100-word wide window that shifts over the text aligns differently with both verses, which will usually be the case.\n", "\n", "Otherwise, we consider the *ll*, the length of the longest clique, and `nc`, the number of cliques.\n", "We set three quality parameters:\n", "* `REC_CLIQUE_RATIO` = 5 : recommended clique ratio\n", "* `DUB_CLIQUE_RATIO` = 15 : dubious clique ratio\n", "* `DEP_CLIQUE_RATIO` = 25 : deprecated clique ratio\n", "\n", "where the *clique ratio* is $100 (ll/nc)$,\n", "i.e. the length of the longest clique divided by the number of cliques as percentage.\n", "\n", "An experiment is *recommended* if its clique ratio is between the recommended and dubious clique ratios.\n", "\n", "It is *dubious* if its clique ratio is between the dubious and deprecated clique ratios.\n", "\n", "It is *deprecated* if its clique ratio is above the deprecated clique ratio.\n", "\n", "# 2.2 Inspecting results\n", "If you click on the hyperlink in the cell, you are taken to a page that gives you\n", "all the details of the results:\n", "\n", "1. A link to a file with all *cliques* (which are the sets of similar passages)\n", "1. A list of links to chapter-by-chapter diff files (for cliques with just two members), and only for\n", " experiments with outcomes that are labeled as *promising* or *unassessed quality* or *mixed results*.\n", "\n", "To get into the variants quickly, inspect the list (2) and click through\n", "to see the actual variant material in chapter context.\n", "\n", "Not all variants occur here, so continue with (1) to see the remaining cliques.\n", "\n", "Sometimes in (2) a chapter diff file does not indicate clearly the relevant common part of both chapters.\n", "In that case you have to consult the big list (1)\n", "\n", "All these results can be downloaded from the\n", "[SHEBANQ GitHub repo](https://github.com/ETCBC/shebanq/tree/master/static/docs/tools/parallel/files)\n", "After downloading the whole directory, open ``experiments.html`` in your browser." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Method\n", "\n", "Here we discuss the method we used to arrive at a list of parallel passages\n", "in the Masoretic Text (MT) of the Hebrew Bible.\n", "\n", "## 3.1 Similarity\n", "\n", "We have to find passages in the MT that are *similar*.\n", "Therefore we *chunk* the text in some way, and then compute the similarities between pairs of chunks.\n", "\n", "There are many ways to define and compute similarity between texts.\n", "Here, we have tried two methods ``SET`` and ``LCS``.\n", "Both methods define similarity as the fraction of common material with respect to the total material.\n", "\n", "### 3.1.1 SET\n", "\n", "The ``SET`` method reduces textual chunks to *sets* of *lexemes*.\n", "This method abstracts from the order and number of occurrences of words in chunks.\n", "\n", "We use as measure for the similarity of chunks $C_1$ and $C_2$ (taken as sets):\n", "\n", "$$ s_{\\rm set}(C_1, C_2) = {\\vert C_1 \\cap C_2\\vert \\over \\vert C_1 \\cup C_2 \\vert} $$\n", "\n", "where $\\vert X \\vert$ is the number of elements in set $X$.\n", "\n", "### 3.1.2 LCS\n", "\n", "The ``LCS`` method is less reductive: chunks are *strings* of *lexemes*,\n", "so the order and number of occurrences of words is retained.\n", "\n", "We use as measure for the similarity of chunks $C_1$ and $C_2$ (taken as strings):\n", "\n", "$$ s_{\\rm lcs}(C_1, C_2) = {\\vert {\\rm LCS}(C_1,C_2)\\vert \\over \\vert C_1\\vert + \\vert C_2 \\vert -\n", "\\vert {\\rm LCS}(C_1,C_2)\\vert} $$\n", "\n", "where ${\\rm LCS}(C_1, C_2)$ is the\n", "[longest common subsequence](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem)\n", "of $C_1$ and $C_2$ and\n", "$\\vert X\\vert$ is the length of sequence $X$.\n", "\n", "It remains to be seen whether we need the extra sophistication of ``LCS``.\n", "The risk is that ``LCS`` could fail to spot related passages when there is a large amount of transposition going on.\n", "The results should have the last word.\n", "\n", "We need to compute the LCS efficiently, and for this we used the python ``Levenshtein`` module:\n", "\n", "``pip install python-Levenshtein``\n", "\n", "whose documentation is\n", "[here](http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html).\n", "\n", "## 3.2 Performance\n", "\n", "Similarity computation is the part where the heavy lifting occurs.\n", "It is basically quadratic in the number of chunks, so if you have verses as chunks (~ 23,000),\n", "you need to do ~ 270,000,000 similarity computations, and if you use sentences (~ 64,000),\n", "you need to do ~ 2,000,000,000 ones!\n", "The computation of a single similarity should be *really* fast.\n", "\n", "Besides that, we use two ways to economize:\n", "\n", "* after having computed a matrix for a specific set of parameter values, we save the matrix to disk;\n", " new runs can load the matrix from disk in a matter of seconds;\n", "* we do not store low similarity values in the matrix, low being < ``MATRIX_THRESHOLD``.\n", "\n", "The ``LCS`` method is more complicated.\n", "We have tried the ``ratio`` method from the ``difflib`` package that is present in the standard python distribution.\n", "This is unbearably slow for our purposes.\n", "The ``ratio`` method in the ``Levenshtein`` package is much quicker.\n", "\n", "See the table for an indication of the amount of work to create the similarity matrix\n", "and the performance per similarity method.\n", "\n", "The *matrix threshold* is the lower bound of similarities that are stored in the matrix.\n", "If a pair of chunks has a lower similarity, no entry will be made in the matrix.\n", "\n", "The computing has been done on a Macbook Air (11\", mid 2012, 1.7 GHz Intel Core i5, 8GB RAM).\n", "\n", "|chunk type |chunk size|similarity method|matrix threshold|# of comparisons|size of matrix (KB)|computing time (min)|\n", "|:----------|---------:|----------------:|---------------:|---------------:|------------------:|-------------------:|\n", "|fixed |100 |LCS |60 | 9,003,646| 7| ? |\n", "|fixed |100 |SET |50 | 9,003,646| 7| ? |\n", "|fixed |50 |LCS |60 | 36,197,286| 37| ? |\n", "|fixed |50 |SET |50 | 36,197,286| 18| ? |\n", "|fixed |20 |LCS |60 | 227,068,705| 2,400| ? |\n", "|fixed |20 |SET |50 | 227,068,705| 113| ? |\n", "|fixed |10 |LCS |60 | 909,020,841| 59,000| ? |\n", "|fixed |10 |SET |50 | 909,020,841| 1,800| ? |\n", "|object |verse |LCS |60 | 269,410,078| 2,300| 31|\n", "|object |verse |SET |50 | 269,410,078| 509| 14|\n", "|object |half verse|LCS |60 | 1,016,396,241| 40,000| 50|\n", "|object |half verse|SET |50 | 1,016,396,241| 3,600| 41|\n", "|object |sentence |LCS |60 | 2,055,975,750| 212,000| 68|\n", "|object |sentence |SET |50 | 2,055,975,750| 82,000| 63|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Workflow\n", "\n", "## 4.1 Chunking\n", "\n", "There are several ways to chunk the text:\n", "\n", "* fixed chunks of approximately ``CHUNK_SIZE`` words\n", "* by object, such as verse, sentence and even chapter\n", "\n", "After chunking, we prepare the chunks for similarity measuring.\n", "\n", "### 4.1.1 Fixed chunking\n", "Fixed chunking is unnatural, but if the chunk size is small, it can yield fair results.\n", "The results are somewhat difficult to inspect, because they generally do not respect constituent boundaries.\n", "It is to be expected that fixed chunks in variant passages will be mutually *out of phase*,\n", "meaning that the chunks involved in these passages are not aligned with each other.\n", "So they will have a lower similarity than they could have if they were aligned.\n", "This is a source of artificial noise in the outcome and/or missed cases.\n", "\n", "If the chunking respects \"natural\" boundaries in the text, there is far less misalignment.\n", "\n", "### 4.1.2 Object chunking\n", "We can also chunk by object, such as `verse`, `half_verse` or `sentence`.\n", "\n", "Chunking by *verse* is very much like chunking in fixed chunks of size 20, performance-wise.\n", "\n", "Chunking by `half_verse` is comparable to fixed chunks of size 10.\n", "\n", "Chunking by `sentence` will generate an enormous amount of\n", "false positives, because there are very many very short sentences (down to 1-word) in the text.\n", "Besides that, the performance overhead is huge.\n", "\n", "The `half_verses` seem to be a very interesting candidate.\n", "They are smaller than verses, but there are less *degenerate cases* compared to with sentences.\n", "From the table above it can be read that `half_verses` require only half as many similarity computations as sentences.\n", "\n", "\n", "## 4.2 Preparing\n", "\n", "We prepare the chunks for the application of the chosen method of similarity computation (``SET`` or ``LCS``).\n", "\n", "In both cases we reduce the text to a sequence of transliterated consonantal *lexemes* without disambiguation.\n", "In fact, we go one step further: we remove the consonants (aleph, wav, yod) that are often silent.\n", "\n", "For ``SET``, we represent each chunk as the set of its reduced lexemes.\n", "\n", "For ``LCS``, we represent each chunk as the string obtained by joining its reduced lexemes separated by white spaces.\n", "\n", "## 4.3 Cliques\n", "\n", "After having computed a sufficient part of the similarity matrix, we set a value for ``SIMILARITY_THRESHOLD``.\n", "All pairs of chunks having at least that similarity are deemed *interesting*.\n", "\n", "We organize the members of such pairs in *cliques*, groups of chunks of which each member is\n", "similar (*similarity* > ``SIMILARITY_THRESHOLD``) to at least one other member.\n", "\n", "We start with no cliques and walk through the pairs whose similarity is above ``SIMILARITY_THRESHOLD``,\n", "and try to put each member into a clique.\n", "\n", "If there is not yet a clique, we make the member in question into a new singleton clique.\n", "\n", "If there are cliques, we find the cliques that have a member similar to the member in question.\n", "If we find several, we merge them all into one clique.\n", "\n", "If there is no such clique, we put the member in a new singleton clique.\n", "\n", "NB: Cliques may *drift*, meaning that they contain members that are completely different from each other.\n", "They are in the same clique, because there is a path of pairwise similar members leading from the one chunk to the other.\n", "\n", "### 4.3.1 Organizing the cliques\n", "In order to handle cases where there are many corresponding verses in corresponding chapters, we produce\n", "chapter-by-chapter diffs in the following way.\n", "\n", "We make a list of all chapters that are involved in cliques.\n", "This yields a list of chapter cliques.\n", "For all *binary* chapters cliques, we generate a colourful diff rendering (as HTML) for the complete two chapters.\n", "\n", "We only do this for *promising* experiments.\n", "\n", "### 4.3.2 Evaluating clique sets\n", "\n", "Not all clique sets are equally worth while.\n", "For example, if we set the ``SIMILARITY_THRESHOLD`` too low, we might get one gigantic clique, especially\n", "in combination with a fine-grained chunking. In other words: we suffer from *clique drifting*.\n", "\n", "We detect clique drifting by looking at the size of the largest clique.\n", "If that is large compared to the total number of chunks, we deem the results unsatisfactory.\n", "\n", "On the other hand, when the ``SIMILARITY_THRESHOLD`` is too high, you might miss a lot of correspondences,\n", "especially when chunks are large, or when we have fixed-size chunks that are out of phase.\n", "\n", "We deem the results of experiments based on a partitioning into fixed length chunks as unsatisfactory, although it\n", "might be interesting to inspect what exactly the damage is.\n", "\n", "At the moment, we have not yet analysed the relative merits of the similarity methods ``SET`` and ``LCS``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Implementation\n", "\n", "\n", "The rest is code. From here we fire up the engines and start computing." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "PICKLE_PROTOCOL = 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting up the context: source file and target directories\n", "\n", "The conversion is executed in an environment of directories, so that sources, temp files and\n", "results are in convenient places and do not have to be shifted around." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[5]:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "repoBase = os.path.expanduser(\"~/github/etcbc\")\n", "coreRepo = \"{}/{}\".format(repoBase, CORE_NAME)\n", "thisRepo = \"{}/{}\".format(repoBase, NAME)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "coreTf = \"{}/tf/{}\".format(coreRepo, VERSION)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "allTemp = \"{}/_temp\".format(thisRepo)\n", "thisTemp = \"{}/_temp/{}\".format(thisRepo, VERSION)\n", "thisTempTf = \"{}/tf\".format(thisTemp)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "thisTf = \"{}/tf/{}\".format(thisRepo, VERSION)\n", "thisNotes = \"{}/shebanq/{}\".format(thisRepo, VERSION)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[6]:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "notesFile = \"crossrefNotes.csv\"\n", "if not os.path.exists(thisNotes):\n", " os.makedirs(thisNotes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Test\n", "\n", "Check whether this conversion is needed in the first place.\n", "Only when run as a script." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[7]:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "if SCRIPT:\n", " (good, work) = utils.mustRun(\n", " None, \"{}/.tf/{}.tfx\".format(thisTf, \"crossref\"), force=FORCE\n", " )\n", " if not good:\n", " stop(good=False)\n", " if not work:\n", " stop(good=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.1 Loading the feature data\n", "\n", "We load the features we need from the BHSA core database." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[8]:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 0.00s Load the existing TF dataset .\n", "..............................................................................................\n", "This is Text-Fabric 9.1.7\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "114 features found and 0 ignored\n" ] } ], "source": [ "utils.caption(4, \"Load the existing TF dataset\")\n", "TF = Fabric(locations=coreTf, modules=[\"\"])" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[9]:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.00s loading features ...\n", " | 0.00s Dataset without structure sections in otext:no structure functions in the T-API\n", " 11s All features loaded/computed - for details use TF.isLoaded()\n" ] }, { "data": { "text/plain": [ "[('Computed',\n", " 'computed-data',\n", " ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n", " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n", " ('Fabric', 'loading', ('TF',)),\n", " ('Locality', 'locality', ('L Locality',)),\n", " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n", " ('Features',\n", " 'node-features',\n", " ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n", " ('Search', 'search', ('S Search',)),\n", " ('Text', 'text', ('T Text',))]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "api = TF.load(\n", " \"\"\"\n", " otype\n", " {} {} {}\n", " book chapter verse number\n", "\"\"\".format(\n", " LEX_FEATURE,\n", " TEXT_FEATURE,\n", " TRAILER_FEATURE,\n", " )\n", ")\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.2 Configuration\n", "\n", "Here are the parameters on which the results crucially depend.\n", "\n", "There are also parameters that control the reporting of the results, such as file locations." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[10]:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# chunking\n", "CHUNK_LABELS = {True: \"fixed\", False: \"object\"}\n", "CHUNK_LBS = {True: \"F\", False: \"O\"}\n", "CHUNK_SIZES = (100, 50, 20, 10)\n", "CHUNK_OBJECTS = (\"chapter\", \"verse\", \"half_verse\", \"sentence\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# preparing\n", "EXCLUDED_CONS = r\"[>WJ=/\\[]\" # weed out weak consonants\n", "EXCLUDED_PAT = re.compile(EXCLUDED_CONS)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# similarity\n", "MATRIX_THRESHOLD = 50\n", "SIM_METHODS = (\"SET\", \"LCS\")\n", "SIMILARITIES = (100, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# printing\n", "DEP_CLIQUE_RATIO = 25\n", "DUB_CLIQUE_RATIO = 15\n", "REC_CLIQUE_RATIO = 5\n", "LARGE_CLIQUE_SIZE = 50\n", "CLIQUES_PER_FILE = 50" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# assessing results\n", "VALUE_LABELS = dict(\n", " mis=\"no results available\",\n", " rec=\"promising results: recommended\",\n", " dep=\"messy results: deprecated\",\n", " dub=\"mixed quality: take care\",\n", " out=\"method deprecated\",\n", " nor=\"unassessed quality: inspection needed\",\n", " lr=\"this experiment is the last one run\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "note that the `TF_TABLE` and `LOCAL_BASE_COMP` are deliberately\n", "located in the version independent\n", "part of the temporary directory.\n", "Here the results of expensive calculations are stored,\n", "to be used by all versions" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# crossrefs for TF\n", "TF_TABLE = \"{}/parallelTable.tsv\".format(allTemp)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# crossrefs for SHEBANQ\n", "SHEBANQ_MATRIX = (False, \"verse\", \"SET\")\n", "SHEBANQ_SIMILARITY = 65\n", "SHEBANQ_TOOL = \"parallel\"\n", "CROSSREF_STATUS = \"!\"\n", "CROSSREF_KEYWORD = \"crossref\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# progress indication\n", "VERBOSE = False\n", "MEGA = 1000000\n", "KILO = 1000\n", "SIMILARITY_PROGRESS = 5 * MEGA\n", "CLIQUES_PROGRESS = 1 * KILO" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "# locations and hyperlinks\n", "LOCAL_BASE_COMP = \"{}/calculus\".format(allTemp)\n", "LOCAL_BASE_OUTP = \"files\"\n", "EXPERIMENT_DIR = \"experiments\"\n", "EXPERIMENT_FILE = \"experiments\"\n", "EXPERIMENT_PATH = \"{}/{}.txt\".format(LOCAL_BASE_OUTP, EXPERIMENT_FILE)\n", "EXPERIMENT_HTML = \"{}/{}.html\".format(LOCAL_BASE_OUTP, EXPERIMENT_FILE)\n", "NOTES_FILE = \"crossref\"\n", "NOTES_PATH = \"{}/{}.csv\".format(LOCAL_BASE_OUTP, NOTES_FILE)\n", "STORED_CLIQUE_DIR = \"stored/cliques\"\n", "STORED_MATRIX_DIR = \"stored/matrices\"\n", "STORED_CHUNK_DIR = \"stored/chunks\"\n", "CHAPTER_DIR = \"chapters\"\n", "CROSSREF_DB_FILE = \"crossrefdb.csv\"\n", "CROSSREF_DB_PATH = \"{}/{}\".format(LOCAL_BASE_OUTP, CROSSREF_DB_FILE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.3 Experiment settings\n", "\n", "For each experiment we have to adapt the configuration settings to the parameters that define the experiment." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[11]:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def reset_params():\n", " global CHUNK_FIXED, CHUNK_SIZE, CHUNK_OBJECT, CHUNK_LB, CHUNK_DESC\n", " global SIMILARITY_METHOD, SIMILARITY_THRESHOLD, MATRIX_THRESHOLD\n", " global meta\n", " meta = collections.OrderedDict()\n", "\n", " # chunking\n", " CHUNK_FIXED = None # kind of chunking: fixed size or by object\n", " CHUNK_SIZE = None # only relevant for CHUNK_FIXED = True\n", " CHUNK_OBJECT = (\n", " None # only relevant for CHUNK_FIXED = False; see CHUNK_OBJECTS in next cell\n", " )\n", " CHUNK_LB = None # computed from CHUNK_FIXED, CHUNK_SIZE, CHUNK_OBJ\n", " CHUNK_DESC = None # computed from CHUNK_FIXED, CHUNK_SIZE, CHUNK_OBJ\n", " # similarity\n", " MATRIX_THRESHOLD = (\n", " None # minimal similarity used to fill the matrix of similarities\n", " )\n", " SIMILARITY_METHOD = None # see SIM_METHODS in next cell\n", " SIMILARITY_THRESHOLD = (\n", " None # minimal similarity used to put elements together in cliques\n", " )\n", " meta = collections.OrderedDict()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def set_matrix_threshold(sim_m=None, chunk_o=None):\n", " global MATRIX_THRESHOLD\n", " the_sim_m = SIMILARITY_METHOD if sim_m is None else sim_m\n", " the_chunk_o = CHUNK_OBJECT if chunk_o is None else chunk_o\n", " MATRIX_THRESHOLD = 50 if the_sim_m == \"SET\" else 60\n", " if the_sim_m == \"SET\":\n", " if the_chunk_o == \"chapter\":\n", " MATRIX_THRESHOLD = 30\n", " else:\n", " MATRIX_THRESHOLD = 50\n", " else:\n", " if the_chunk_o == \"chapter\":\n", " MATRIX_THRESHOLD = 55\n", " else:\n", " MATRIX_THRESHOLD = 60" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def do_params_chunk(chunk_f, chunk_i):\n", " global CHUNK_FIXED, CHUNK_SIZE, CHUNK_OBJECT, CHUNK_LB, CHUNK_DESC\n", " do_chunk = False\n", " if (\n", " chunk_f != CHUNK_FIXED\n", " or (chunk_f and chunk_i != CHUNK_SIZE)\n", " or (not chunk_f and chunk_i != CHUNK_OBJECT)\n", " ):\n", " do_chunk = True\n", " CHUNK_FIXED = chunk_f\n", " if chunk_f:\n", " CHUNK_SIZE = chunk_i\n", " else:\n", " CHUNK_OBJECT = chunk_i\n", "\n", " CHUNK_LB = CHUNK_LBS[CHUNK_FIXED]\n", " CHUNK_DESC = CHUNK_SIZE if CHUNK_FIXED else CHUNK_OBJECT\n", "\n", " for p in (\n", " \"{}/{}\".format(LOCAL_BASE_OUTP, EXPERIMENT_DIR),\n", " \"{}/{}\".format(LOCAL_BASE_COMP, STORED_CHUNK_DIR),\n", " ):\n", " if not os.path.exists(p):\n", " os.makedirs(p)\n", "\n", " return do_chunk" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def do_params(chunk_f, chunk_i, sim_m, sim_thr):\n", " global CHUNK_FIXED, CHUNK_SIZE, CHUNK_OBJECT, CHUNK_LB, CHUNK_DESC\n", " global SIMILARITY_METHOD, SIMILARITY_THRESHOLD, MATRIX_THRESHOLD\n", " global meta\n", " do_chunk = False\n", " do_prep = False\n", " do_sim = False\n", " do_clique = False\n", "\n", " meta = collections.OrderedDict()\n", " if (\n", " chunk_f != CHUNK_FIXED\n", " or (chunk_f and chunk_i != CHUNK_SIZE)\n", " or (not chunk_f and chunk_i != CHUNK_OBJECT)\n", " ):\n", " do_chunk = True\n", " do_prep = True\n", " do_sim = True\n", " do_clique = True\n", " CHUNK_FIXED = chunk_f\n", " if chunk_f:\n", " CHUNK_SIZE = chunk_i\n", " else:\n", " CHUNK_OBJECT = chunk_i\n", " if sim_m != SIMILARITY_METHOD:\n", " do_prep = True\n", " do_sim = True\n", " do_clique = True\n", " SIMILARITY_METHOD = sim_m\n", " if sim_thr != SIMILARITY_THRESHOLD:\n", " do_clique = True\n", " SIMILARITY_THRESHOLD = sim_thr\n", " set_matrix_threshold()\n", " if SIMILARITY_THRESHOLD < MATRIX_THRESHOLD:\n", " return (False, False, False, False, True)\n", "\n", " CHUNK_LB = CHUNK_LBS[CHUNK_FIXED]\n", " CHUNK_DESC = CHUNK_SIZE if CHUNK_FIXED else CHUNK_OBJECT\n", "\n", " meta[\"CHUNK TYPE\"] = (\n", " \"FIXED {}\".format(CHUNK_SIZE)\n", " if CHUNK_FIXED\n", " else \"OBJECT {}\".format(CHUNK_OBJECT)\n", " )\n", " meta[\"MATRIX THRESHOLD\"] = MATRIX_THRESHOLD\n", " meta[\"SIMILARITY METHOD\"] = SIMILARITY_METHOD\n", " meta[\"SIMILARITY THRESHOLD\"] = SIMILARITY_THRESHOLD\n", "\n", " for p in (\n", " \"{}/{}\".format(LOCAL_BASE_OUTP, EXPERIMENT_DIR),\n", " \"{}/{}\".format(LOCAL_BASE_OUTP, CHAPTER_DIR),\n", " \"{}/{}\".format(LOCAL_BASE_COMP, STORED_CLIQUE_DIR),\n", " \"{}/{}\".format(LOCAL_BASE_COMP, STORED_MATRIX_DIR),\n", " \"{}/{}\".format(LOCAL_BASE_COMP, STORED_CHUNK_DIR),\n", " ):\n", " if not os.path.exists(p):\n", " os.makedirs(p)\n", "\n", " return (do_chunk, do_prep, do_sim, do_clique, False)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "reset_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.4 Chunking\n", "\n", "We divide the text into chunks to be compared. The result is ``chunks``,\n", "which is a list of lists.\n", "Every chunk is a list of word nodes." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[12]:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def chunking(do_chunk):\n", " global chunks, book_rank\n", " if not do_chunk:\n", " TF.info(\n", " \"CHUNKING ({} {}): already chunked into {} chunks\".format(\n", " CHUNK_LB, CHUNK_DESC, len(chunks)\n", " )\n", " )\n", " meta[\"# CHUNKS\"] = len(chunks)\n", " return\n", "\n", " chunk_path = \"{}/{}/chunk_{}_{}\".format(\n", " LOCAL_BASE_COMP,\n", " STORED_CHUNK_DIR,\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " )\n", "\n", " if os.path.exists(chunk_path):\n", " with open(chunk_path, \"rb\") as f:\n", " chunks = pickle.load(f)\n", " TF.info(\n", " \"CHUNKING ({} {}): Loaded: {:>5} chunks\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " len(chunks),\n", " )\n", " )\n", " else:\n", " TF.info(\"CHUNKING ({} {})\".format(CHUNK_LB, CHUNK_DESC))\n", " chunks = []\n", " book_rank = {}\n", " for b in F.otype.s(\"book\"):\n", " book_name = F.book.v(b)\n", " book_rank[book_name] = b\n", " words = L.d(b, otype=\"word\")\n", " nwords = len(words)\n", " if CHUNK_FIXED:\n", " nchunks = nwords // CHUNK_SIZE\n", " if nchunks == 0:\n", " nchunks = 1\n", " common_incr = nwords\n", " special_incr = 0\n", " else:\n", " rem = nwords % CHUNK_SIZE\n", " common_incr = rem // nchunks\n", " special_incr = rem % nchunks\n", " word_in_chunk = -1\n", " cur_chunk = -1\n", " these_chunks = []\n", "\n", " for w in words:\n", " word_in_chunk += 1\n", " if word_in_chunk == 0 or (\n", " word_in_chunk\n", " >= CHUNK_SIZE\n", " + common_incr\n", " + (1 if cur_chunk < special_incr else 0)\n", " ):\n", " word_in_chunk = 0\n", " these_chunks.append([])\n", " cur_chunk += 1\n", " these_chunks[-1].append(w)\n", " else:\n", " these_chunks = [\n", " L.d(c, otype=\"word\") for c in L.d(b, otype=CHUNK_OBJECT)\n", " ]\n", "\n", " chunks.extend(these_chunks)\n", "\n", " chunkvolume = sum(len(c) for c in these_chunks)\n", " if VERBOSE:\n", " TF.info(\n", " \"CHUNKING ({} {}): {:<20s} {:>5} words; {:>5} chunks; sizes {:>5} to {:>5}; {:>5}\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " book_name,\n", " nwords,\n", " len(these_chunks),\n", " min(len(c) for c in these_chunks),\n", " max(len(c) for c in these_chunks),\n", " \"OK\" if chunkvolume == nwords else \"ERROR\",\n", " )\n", " )\n", " with open(chunk_path, \"wb\") as f:\n", " pickle.dump(chunks, f, protocol=PICKLE_PROTOCOL)\n", " TF.info(\"CHUNKING ({} {}): Made {} chunks\".format(CHUNK_LB, CHUNK_DESC, len(chunks)))\n", " meta[\"# CHUNKS\"] = len(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.5 Preparing\n", "\n", "In order to compute similarities between chunks, we have to compile each chunk into the information that really matters for the comparison. This is dependent on the chosen method of similarity computing.\n", "\n", "### 5.5.1 Preparing for SET comparison\n", "\n", "We reduce words to their lexemes (dictionary entries) and from them we also remove the aleph, wav, and yods.\n", "The lexeme feature also contains characters (`/ [ =`) to disambiguate homonyms. We also remove these.\n", "If we end up with something empty, we skip it.\n", "Eventually, we take the set of these reduced word lexemes, so that we effectively ignore order and multiplicity of words. In other words: the resulting similarity will be based on lexeme content.\n", "\n", "### 5.5.2 Preparing for LCS comparison\n", "\n", "Again, we reduce words to their lexemes as for the SET preparation, and we do the same weeding of consonants and empty strings. But then we concatenate everything, separated by a space. So we preserve order and multiplicity." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[13]:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def preparing(do_prepare):\n", " global chunk_data\n", " if not do_prepare:\n", " TF.info(\n", " \"PREPARING ({} {} {}): Already prepared\".format(\n", " CHUNK_LB, CHUNK_DESC, SIMILARITY_METHOD\n", " )\n", " )\n", " return\n", " TF.info(\"PREPARING ({} {} {})\".format(CHUNK_LB, CHUNK_DESC, SIMILARITY_METHOD))\n", " chunk_data = []\n", " if SIMILARITY_METHOD == \"SET\":\n", " for c in chunks:\n", " words = (\n", " EXCLUDED_PAT.sub(\"\", Fs(LEX_FEATURE).v(w).replace(\"<\", \"O\")) for w in c\n", " )\n", " clean_words = (w for w in words if w != \"\")\n", " this_data = frozenset(clean_words)\n", " chunk_data.append(this_data)\n", " else:\n", " for c in chunks:\n", " words = (\n", " EXCLUDED_PAT.sub(\"\", Fs(LEX_FEATURE).v(w).replace(\"<\", \"O\")) for w in c\n", " )\n", " clean_words = (w for w in words if w != \"\")\n", " this_data = \" \".join(clean_words)\n", " chunk_data.append(this_data)\n", " TF.info(\n", " \"PREPARING ({} {} {}): Done {} chunks.\".format(\n", " CHUNK_LB, CHUNK_DESC, SIMILARITY_METHOD, len(chunk_data)\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.6 Similarity computation\n", "\n", "Here we implement our two ways of similarity computation.\n", "Both need a massive amount of work, especially for experiments with many small chunks.\n", "The similarities are stored in a ``matrix``, a data structure that stores a similarity number for each pair of chunk indexes.\n", "Most pair of chunks will be dissimilar. In order to save space, we do not store similarities below a certain threshold.\n", "We store matrices for re-use.\n", "\n", "### 5.6.1 SET similarity\n", "The core is an operation on the sets, associated with the chunks by the prepare step. We take the cardinality of the intersection divided by the cardinality of the union.\n", "Intuitively, we compute the proportion of what two chunks have in common against their total material.\n", "\n", "In case the union is empty (both chunks have yielded an empty set), we deem the chunks not to be interesting as a parallel pair, and we set the similarity to 0.\n", "\n", "### 5.6.2 LCS similarity\n", "The core is the method `ratio()`, taken from the Levenshtein module.\n", "Remember that the preparation step yielded a space separated string of lexemes, and these strings are compared on the basis of edit distance." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[14]:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def similarity_post():\n", " nequals = len({x for x in chunk_dist if chunk_dist[x] >= 100})\n", " cmin = min(chunk_dist.values()) if len(chunk_dist) else \"!empty set!\"\n", " cmax = max(chunk_dist.values()) if len(chunk_dist) else \"!empty set!\"\n", " meta[\"LOWEST AVAILABLE SIMILARITY\"] = cmin\n", " meta[\"HIGHEST AVAILABLE SIMILARITY\"] = cmax\n", " meta[\"# EQUAL COMPARISONS\"] = nequals\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): similarities between {} and {}. {} are 100%\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " cmin,\n", " cmax,\n", " nequals,\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def similarity(do_sim):\n", " global chunk_dist\n", " total_chunks = len(chunks)\n", " total_distances = total_chunks * (total_chunks - 1) // 2\n", " meta[\"# SIMILARITY COMPARISONS\"] = total_distances\n", "\n", " SIMILARITY_PROGRESS = total_distances // 100\n", " if SIMILARITY_PROGRESS >= MEGA:\n", " sim_unit = MEGA\n", " sim_lb = \"M\"\n", " else:\n", " sim_unit = KILO\n", " sim_lb = \"K\"\n", "\n", " if not do_sim:\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): Using {:>5} {} ({}) comparisons with {} entries in matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " total_distances // sim_unit,\n", " sim_lb,\n", " total_distances,\n", " len(chunk_dist),\n", " )\n", " )\n", " meta[\"# STORED SIMILARITIES\"] = len(chunk_dist)\n", " similarity_post()\n", " return\n", "\n", " matrix_path = \"{}/{}/matrix_{}_{}_{}_{}\".format(\n", " LOCAL_BASE_COMP,\n", " STORED_MATRIX_DIR,\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " )\n", "\n", " if os.path.exists(matrix_path):\n", " with open(matrix_path, \"rb\") as f:\n", " chunk_dist = pickle.load(f)\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): Loaded: {:>5} {} ({}) comparisons with {} entries in matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " total_distances // sim_unit,\n", " sim_lb,\n", " total_distances,\n", " len(chunk_dist),\n", " )\n", " )\n", " meta[\"# STORED SIMILARITIES\"] = len(chunk_dist)\n", " similarity_post()\n", " return\n", "\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): Computing {:>5} {} ({}) comparisons and saving entries in matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " total_distances // sim_unit,\n", " sim_lb,\n", " total_distances,\n", " )\n", " )\n", "\n", " chunk_dist = {}\n", " wc = 0\n", " wt = 0\n", " if SIMILARITY_METHOD == \"SET\":\n", " # method SET: all chunks have been reduced to sets, ratio between lengths of intersection and union\n", " for i in range(total_chunks):\n", " c_i = chunk_data[i]\n", " for j in range(i + 1, total_chunks):\n", " c_j = chunk_data[j]\n", " u = len(c_i | c_j)\n", "\n", " # HERE COMES THE SIMILARITY COMPUTATION\n", " d = 100 * len(c_i & c_j) / u if u != 0 else 0\n", "\n", " # HERE WE STORE THE OUTCOME\n", " if d >= MATRIX_THRESHOLD:\n", " chunk_dist[(i, j)] = d\n", " wc += 1\n", " wt += 1\n", " if wc == SIMILARITY_PROGRESS:\n", " wc = 0\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): Computed {:>5} {} comparisons and saved {} entries in matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " wt // sim_unit,\n", " sim_lb,\n", " len(chunk_dist),\n", " )\n", " )\n", " elif SIMILARITY_METHOD == \"LCS\":\n", " # method LCS: chunks are sequence aligned, ratio between length of all common parts and total length\n", " for i in range(total_chunks):\n", " c_i = chunk_data[i]\n", " for j in range(i + 1, total_chunks):\n", " c_j = chunk_data[j]\n", "\n", " # HERE COMES THE SIMILARITY COMPUTATION\n", " d = 100 * ratio(c_i, c_j)\n", "\n", " # HERE WE STORE THE OUTCOME\n", " if d >= MATRIX_THRESHOLD:\n", " chunk_dist[(i, j)] = d\n", " wc += 1\n", " wt += 1\n", " if wc == SIMILARITY_PROGRESS:\n", " wc = 0\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): Computed {:>5} {} comparisons and saved {} entries in matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " wt // sim_unit,\n", " sim_lb,\n", " len(chunk_dist),\n", " )\n", " )\n", "\n", " with open(matrix_path, \"wb\") as f:\n", " pickle.dump(chunk_dist, f, protocol=PICKLE_PROTOCOL)\n", "\n", " TF.info(\n", " \"SIMILARITY ({} {} {} M>{}): Computed {:>5} {} ({}) comparisons and saved {} entries in matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " wt // sim_unit,\n", " sim_lb,\n", " wt,\n", " len(chunk_dist),\n", " )\n", " )\n", "\n", " meta[\"# STORED SIMILARITIES\"] = len(chunk_dist)\n", " similarity_post()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.7 Cliques\n", "\n", "Based on the value for the ``SIMILARITY_THRESHOLD`` we use the similarity matrix to pick the *interesting*\n", "similar pairs out of it. From these pairs we lump together our cliques.\n", "\n", "Our list of experiments will select various values for ``SIMILARITY_THRESHOLD``, which will result\n", "in various types of clique behavior.\n", "\n", "We store computed cliques for re-use.\n", "\n", "## 5.7.1 Selecting passages\n", "\n", "We take all pairs from the similarity matrix which are above the threshold, and add both members to a list of passages.\n", "\n", "## 5.7.2 Growing cliques\n", "We inspect all passages in our set, and try to add them to the cliques we are growing.\n", "We start with an empty set of cliques.\n", "Each passage is added to a clique with which it has *enough familiarity*, otherwise it is added to a new clique.\n", "*Enough familiarity means*: the passage is similar to at least one member of the clique, and the similarity is at least ``SIMILARITY_THRESHOLD``.\n", "It is possible that a passage is thus added to more than one clique. In that case, those cliques are merged.\n", "This may lead to growing very large cliques if ``SIMILARITY_THRESHOLD`` is too low." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[15]:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def key_chunk(i):\n", " c = chunks[i]\n", " w = c[0]\n", " return (\n", " -len(c),\n", " L.u(w, otype=\"book\")[0],\n", " L.u(w, otype=\"chapter\")[0],\n", " L.u(w, otype=\"verse\")[0],\n", " )" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def meta_clique_pre():\n", " global similars, passages\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): inspecting the similarity matrix\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " )\n", " similars = {x for x in chunk_dist if chunk_dist[x] >= SIMILARITY_THRESHOLD}\n", " passage_set = set()\n", " for (i, j) in similars:\n", " passage_set.add(i)\n", " passage_set.add(j)\n", " passages = sorted(passage_set, key=key_chunk)\n", "\n", " meta[\"# SIMILAR COMPARISONS\"] = len(similars)\n", " meta[\"# SIMILAR PASSAGES\"] = len(passages)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def meta_clique_pre2():\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): {} relevant similarities between {} passages\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(similars),\n", " len(passages),\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def meta_clique_post():\n", " global l_c_l\n", " meta[\"# CLIQUES\"] = len(cliques)\n", " scliques = collections.Counter()\n", " for c in cliques:\n", " scliques[len(c)] += 1\n", " l_c_l = max(scliques.keys()) if len(scliques) > 0 else 0\n", " totmn = 0\n", " totcn = 0\n", " for (ln, n) in sorted(scliques.items(), key=lambda x: x[0]):\n", " totmn += ln * n\n", " totcn += n\n", " if VERBOSE:\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): {:>4} cliques of length {:>4}\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " n,\n", " ln,\n", " )\n", " )\n", " meta[\"# CLIQUES of LENGTH {:>4}\".format(ln)] = n\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): {} members in {} cliques\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " totmn,\n", " totcn,\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def cliqueing(do_clique):\n", " global cliques\n", " if not do_clique:\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): Already loaded {} cliques out of {} candidates from {} comparisons\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(cliques),\n", " len(passages),\n", " len(similars),\n", " )\n", " )\n", " meta_clique_pre2()\n", " meta_clique_post()\n", " return\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): fetching similars and chunk candidates\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " )\n", " meta_clique_pre()\n", " meta_clique_pre2()\n", " clique_path = \"{}/{}/clique_{}_{}_{}_{}_{}\".format(\n", " LOCAL_BASE_COMP,\n", " STORED_CLIQUE_DIR,\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " if os.path.exists(clique_path):\n", " with open(clique_path, \"rb\") as f:\n", " cliques = pickle.load(f)\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): Loaded: {:>5} cliques out of {:>6} chunks from {} comparisons\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(cliques),\n", " len(passages),\n", " len(similars),\n", " )\n", " )\n", " meta_clique_post()\n", " return\n", "\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): Composing cliques out of {:>6} chunks from {} comparisons\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(passages),\n", " len(similars),\n", " )\n", " )\n", " cliques_unsorted = []\n", " np = 0\n", " npc = 0\n", " for i in passages:\n", " added = None\n", " removable = set()\n", " for (k, c) in enumerate(cliques_unsorted):\n", " origc = tuple(c)\n", " for j in origc:\n", " d = (\n", " chunk_dist.get((i, j), 0)\n", " if i < j\n", " else chunk_dist.get((j, i), 0)\n", " if j < i\n", " else 0\n", " )\n", " if d >= SIMILARITY_THRESHOLD:\n", " if (\n", " added is None\n", " ): # the passage has not been added to any clique yet\n", " c.add(i)\n", " added = k # remember that we added the passage to this clique\n", " else: # the passage has alreay been added to another clique:\n", " # we merge this clique with that one\n", " cliques_unsorted[added] |= c\n", " removable.add(\n", " k\n", " ) # we remember that we have merged this clicque into another one,\n", " # so we can throw away this clicque later\n", " break\n", " if added is None:\n", " cliques_unsorted.append({i})\n", " else:\n", " if len(removable):\n", " cliques_unsorted = [\n", " c for (k, c) in enumerate(cliques_unsorted) if k not in removable\n", " ]\n", " np += 1\n", " npc += 1\n", " if npc == CLIQUES_PROGRESS:\n", " npc = 0\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): Composed {:>5} cliques out of {:>6} chunks\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(cliques_unsorted),\n", " np,\n", " )\n", " )\n", " cliques = sorted([tuple(sorted(c, key=key_chunk)) for c in cliques_unsorted])\n", " with open(clique_path, \"wb\") as f:\n", " pickle.dump(cliques, f, protocol=PICKLE_PROTOCOL)\n", " meta_clique_post()\n", " TF.info(\n", " \"CLIQUES ({} {} {} M>{} S>{}): Composed and saved {:>5} cliques out of {:>6} chunks from {} comparisons\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(cliques),\n", " len(passages),\n", " len(similars),\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.8 Output\n", "\n", "We deliver the output of our experiments in various ways, all in HTML.\n", "\n", "We generate chapter based diff outputs with color-highlighted differences between the chapters for every pair of chapters that merit it.\n", "\n", "For every (*good*) experiment, we produce a big list of its cliques, and for\n", "every such clique, we produce a diff-view of its members.\n", "\n", "Big cliques will be split into several files.\n", "\n", "Clique listings will also contain metadata: the value of the experiment parameters.\n", "\n", "### 5.8.1 Format definitions\n", "Here are the definitions for formatting the (HTML) output." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[16]:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# clique lists\n", "css = \"\"\"\n", "td.vl {\n", " font-family: Verdana, Arial, sans-serif;\n", " font-size: small;\n", " text-align: right;\n", " color: #aaaaaa;\n", " width: 10%;\n", " direction: ltr;\n", " border-left: 2px solid #aaaaaa;\n", " border-right: 2px solid #aaaaaa;\n", "}\n", "td.ht {\n", " font-family: Ezra SIL, SBL Hebrew, Verdana, sans-serif;\n", " font-size: x-large;\n", " line-height: 1.7;\n", " text-align: right;\n", " direction: rtl;\n", "}\n", "table.ht {\n", " width: 100%;\n", " direction: rtl;\n", " border-collapse: collapse;\n", "}\n", "td.ht {\n", " border-left: 2px solid #aaaaaa;\n", " border-right: 2px solid #aaaaaa;\n", "}\n", "tr.ht.tb {\n", " border-top: 2px solid #aaaaaa;\n", " border-left: 2px solid #aaaaaa;\n", " border-right: 2px solid #aaaaaa;\n", "}\n", "tr.ht.bb {\n", " border-bottom: 2px solid #aaaaaa;\n", " border-left: 2px solid #aaaaaa;\n", " border-right: 2px solid #aaaaaa;\n", "}\n", "span.m {\n", " background-color: #aaaaff;\n", "}\n", "span.f {\n", " background-color: #ffaaaa;\n", "}\n", "span.x {\n", " background-color: #ffffaa;\n", " color: #bb0000;\n", "}\n", "span.delete {\n", " background-color: #ffaaaa;\n", "}\n", "span.insert {\n", " background-color: #aaffaa;\n", "}\n", "span.replace {\n", " background-color: #ffff00;\n", "}\n", "\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# chapter diffs\n", "diffhead = \"\"\"\n", "\n", " \n", " \n", " \n", "\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# table of experiments\n", "ecss = \"\"\"\n", "\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "legend = \"\"\"\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
{mis}
{rec}
{dep}
{dub}
{out}
{nor}
\n", "\"\"\".format(\n", " **VALUE_LABELS\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.8.2 Formatting clique lists" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[17]:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def xterse_chunk(i):\n", " chunk = chunks[i]\n", " fword = chunk[0]\n", " book = L.u(fword, otype=\"book\")[0]\n", " chapter = L.u(fword, otype=\"chapter\")[0]\n", " return (book, chapter)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "def xterse_clique(ii):\n", " return tuple(sorted({xterse_chunk(i) for i in ii}))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "def terse_chunk(i):\n", " chunk = chunks[i]\n", " fword = chunk[0]\n", " book = L.u(fword, otype=\"book\")[0]\n", " chapter = L.u(fword, otype=\"chapter\")[0]\n", " verse = L.u(fword, otype=\"verse\")[0]\n", " return (book, chapter, verse)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "def terse_clique(ii):\n", " return tuple(sorted({terse_chunk(i) for i in ii}))" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "def verse_chunk(i):\n", " (bk, ch, vs) = i\n", " book = F.book.v(bk)\n", " chapter = F.chapter.v(ch)\n", " verse = F.verse.v(vs)\n", " text = \"\".join(\n", " \"{}{}\".format(Fs(TEXT_FEATURE).v(w), Fs(TRAILER_FEATURE).v(w))\n", " for w in L.d(vs, otype=\"word\")\n", " )\n", " verse_label = '{} {}:{}'.format(book, chapter, verse)\n", " htext = '{}{}'.format(verse_label, text)\n", " return '{}'.format(htext)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "def verse_clique(ii):\n", " return '{}
\\n'.format(\n", " \"\".join(verse_chunk(i) for i in sorted(ii))\n", " )" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "def condense(vlabels):\n", " cnd = \"\"\n", " (cur_b, cur_c) = (None, None)\n", " for (b, c, v) in vlabels:\n", " c = str(c)\n", " v = str(v)\n", " sep = (\n", " \"\"\n", " if cur_b is None\n", " else \". \"\n", " if cur_b != b\n", " else \"; \"\n", " if cur_c != c\n", " else \", \"\n", " )\n", " show_b = b + \" \" if cur_b != b else \"\"\n", " show_c = c + \":\" if cur_b != b or cur_c != c else \"\"\n", " (cur_b, cur_c) = (b, c)\n", " cnd += \"{}{}{}{}\".format(sep, show_b, show_c, v)\n", " return cnd" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def print_diff(a, b):\n", " arep = \"\"\n", " brep = \"\"\n", " for (lb, ai, aj, bi, bj) in SequenceMatcher(\n", " isjunk=None, a=a, b=b, autojunk=False\n", " ).get_opcodes():\n", " if lb == \"equal\":\n", " arep += a[ai:aj]\n", " brep += b[bi:bj]\n", " elif lb == \"delete\":\n", " arep += '{}'.format(lb, a[ai:aj])\n", " elif lb == \"insert\":\n", " brep += '{}'.format(lb, b[bi:bj])\n", " else:\n", " arep += '{}'.format(lb, a[ai:aj])\n", " brep += '{}'.format(lb, b[bi:bj])\n", " return (arep, brep)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def print_chunk_fine(prev, text, verse_labels, prevlabels):\n", " if prev is None:\n", " return \"\"\"\n", "{}{}\n", "\"\"\".format(\n", " condense(verse_labels),\n", " text,\n", " )\n", " else:\n", " (prevline, textline) = print_diff(prev, text)\n", " return \"\"\"\n", "{}{}\n", "{}{}\n", "\"\"\".format(\n", " condense(prevlabels) if prevlabels is not None else \"previous\",\n", " prevline,\n", " condense(verse_labels),\n", " textline,\n", " )" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "def print_chunk_coarse(text, verse_labels):\n", " return \"\"\"\n", "{}{}\n", "\"\"\".format(\n", " condense(verse_labels),\n", " text,\n", " )" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "def print_clique(ii, ncliques):\n", " return (\n", " print_clique_fine(ii)\n", " if len(ii) < ncliques * DEP_CLIQUE_RATIO / 100\n", " else print_clique_coarse(ii)\n", " )" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "def print_clique_fine(ii):\n", " condensed = collections.OrderedDict()\n", " for i in sorted(ii, key=lambda c: (-len(chunks[c]), c)):\n", " chunk = chunks[i]\n", " fword = chunk[0]\n", " book = F.book.v(L.u(fword, otype=\"book\")[0])\n", " chapter = F.chapter.v(L.u(fword, otype=\"chapter\")[0])\n", " verse = F.verse.v(L.u(fword, otype=\"verse\")[0])\n", " text = \"\".join(\n", " \"{}{}\".format(Fs(TEXT_FEATURE).v(w), Fs(TRAILER_FEATURE).v(w))\n", " for w in chunk\n", " )\n", " condensed.setdefault(text, []).append((book, chapter, verse))\n", " result = []\n", " nv = len(condensed.items())\n", " prev = None\n", " for (text, verse_labels) in condensed.items():\n", " if prev is None:\n", " if nv == 1:\n", " result.append(print_chunk_fine(None, text, verse_labels, None))\n", " else:\n", " prev = text\n", " prevlabels = verse_labels\n", " continue\n", " else:\n", " result.append(print_chunk_fine(prev, text, verse_labels, prevlabels))\n", " prev = text\n", " prevlabels = None\n", " return '{}
\\n'.format(\"\".join(result))" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "def print_clique_coarse(ii):\n", " condensed = collections.OrderedDict()\n", " for i in sorted(ii, key=lambda c: (-len(chunks[c]), c))[0:LARGE_CLIQUE_SIZE]:\n", " chunk = chunks[i]\n", " fword = chunk[0]\n", " book = F.book.v(L.u(fword, otype=\"book\")[0])\n", " chapter = F.chapter.v(L.u(fword, otype=\"chapter\")[0])\n", " verse = F.verse.v(L.u(fword, otype=\"verse\")[0])\n", " text = \"\".join(\n", " \"{}{}\".format(Fs(TEXT_FEATURE).v(w), Fs(TRAILER_FEATURE).v(w))\n", " for w in chunk\n", " )\n", " condensed.setdefault(text, []).append((book, chapter, verse))\n", " result = []\n", " for (text, verse_labels) in condensed.items():\n", " result.append(print_chunk_coarse(text, verse_labels))\n", " if len(ii) > LARGE_CLIQUE_SIZE:\n", " result.append(\n", " print_chunk_coarse(\"+ {} ...\".format(len(ii) - LARGE_CLIQUE_SIZE), [])\n", " )\n", " return '{}
\\n'.format(\"\".join(result))" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "def index_clique(bnm, n, ii, ncliques):\n", " return (\n", " index_clique_fine(bnm, n, ii)\n", " if len(ii) < ncliques * DEP_CLIQUE_RATIO / 100\n", " else index_clique_coarse(bnm, n, ii)\n", " )" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "def index_clique_fine(bnm, n, ii):\n", " verse_labels = []\n", " for i in sorted(ii, key=lambda c: (-len(chunks[c]), c)):\n", " chunk = chunks[i]\n", " fword = chunk[0]\n", " book = F.book.v(L.u(fword, otype=\"book\")[0])\n", " chapter = F.chapter.v(L.u(fword, otype=\"chapter\")[0])\n", " verse = F.verse.v(L.u(fword, otype=\"verse\")[0])\n", " verse_labels.append((book, chapter, verse))\n", " reffl = \"{}_{}\".format(bnm, n // CLIQUES_PER_FILE)\n", " return '

{} {}

'.format(\n", " n,\n", " reffl,\n", " n,\n", " condense(verse_labels),\n", " )" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "def index_clique_coarse(bnm, n, ii):\n", " verse_labels = []\n", " for i in sorted(ii, key=lambda c: (-len(chunks[c]), c))[0:LARGE_CLIQUE_SIZE]:\n", " chunk = chunks[i]\n", " fword = chunk[0]\n", " book = F.book.v(L.u(fword, otype=\"book\")[0])\n", " chapter = F.chapter.v(L.u(fword, otype=\"chapter\")[0])\n", " verse = F.verse.v(L.u(fword, otype=\"verse\")[0])\n", " verse_labels.append((book, chapter, verse))\n", " reffl = \"{}_{}\".format(bnm, n // CLIQUES_PER_FILE)\n", " extra = (\n", " \"+ {} ...\".format(len(ii) - LARGE_CLIQUE_SIZE)\n", " if len(ii) > LARGE_CLIQUE_SIZE\n", " else \"\"\n", " )\n", " return '

{} {}{}

'.format(\n", " n,\n", " reffl,\n", " n,\n", " condense(verse_labels),\n", " extra,\n", " )" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "def lines_chapter(c):\n", " lines = []\n", " for v in L.d(c, otype=\"verse\"):\n", " vl = F.verse.v(v)\n", " text = \"\".join(\n", " \"{}{}\".format(Fs(TEXT_FEATURE).v(w), Fs(TRAILER_FEATURE).v(w))\n", " for w in L.d(v, otype=\"word\")\n", " )\n", " lines.append(\"{} {}\".format(vl, text.replace(\"\\n\", \" \")))\n", " return lines" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def compare_chapters(c1, c2, lb1, lb2):\n", " dh = difflib.HtmlDiff(wrapcolumn=80)\n", " table_html = dh.make_table(\n", " lines_chapter(c1),\n", " lines_chapter(c2),\n", " fromdesc=lb1,\n", " todesc=lb2,\n", " context=False,\n", " numlines=5,\n", " )\n", " htext = \"\"\"{}{}\"\"\".format(diffhead, table_html)\n", " return htext" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.8.3 Compiling the table of experiments\n", "\n", "Here we generate the table of experiments, complete with the colouring according to their assessments." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[18]:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "# generate the table of experiments\n", "def gen_html(standalone=False):\n", " global other_exps\n", " TF.info(\n", " \"EXPERIMENT: Generating html report{}\".format(\n", " \"(standalone)\" if standalone else \"\"\n", " )\n", " )\n", " stats = collections.Counter()\n", " pre = (\n", " \"\"\"\n", "\n", "\n", "\n", "{}\n", "\n", "\n", "\"\"\".format(\n", " ecss\n", " )\n", " if standalone\n", " else \"\"\n", " )\n", "\n", " post = (\n", " \"\"\"\n", "\n", "\"\"\"\n", " if standalone\n", " else \"\"\n", " )\n", "\n", " experiments = \"\"\"\n", "{}\n", "{}\n", "\n", "{}\n", "\"\"\".format(\n", " pre, legend, \"\".join(\"\".format(sim_thr) for sim_thr in SIMILARITIES)\n", " )\n", "\n", " for chunk_f in (True, False):\n", " if chunk_f:\n", " chunk_items = CHUNK_SIZES\n", " else:\n", " chunk_items = CHUNK_OBJECTS\n", " chunk_lb = CHUNK_LBS[chunk_f]\n", " for chunk_i in chunk_items:\n", " for sim_m in SIM_METHODS:\n", " set_matrix_threshold(sim_m=sim_m, chunk_o=chunk_i)\n", " these_outputs = outputs.get(MATRIX_THRESHOLD, {})\n", " experiments += \"\".format(\n", " CHUNK_LABELS[chunk_f],\n", " chunk_i,\n", " sim_m,\n", " )\n", " for sim_thr in SIMILARITIES:\n", " okey = (chunk_lb, chunk_i, sim_m, sim_thr)\n", " values = these_outputs.get(okey)\n", " if values is None:\n", " result = ''\n", " stats[\"mis\"] += 1\n", " else:\n", " (npassages, ncliques, longest_clique_len) = values\n", " cls = assess_exp(\n", " chunk_f, npassages, ncliques, longest_clique_len\n", " )\n", " stats[cls] += 1\n", " (lr_el, lr_lb) = (\"\", \"\")\n", " if (\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " SIMILARITY_THRESHOLD,\n", " ) == (\n", " chunk_lb,\n", " chunk_i,\n", " sim_m,\n", " sim_thr,\n", " ):\n", " lr_el = '*'\n", " lr_lb = VALUE_LABELS[\"lr\"]\n", " result = \"\"\"\n", "\"\"\".format(\n", " cls,\n", " lr_lb,\n", " lr_el,\n", " npassages,\n", " \"\" if standalone else LOCAL_BASE_OUTP + \"/\",\n", " EXPERIMENT_DIR,\n", " chunk_lb,\n", " chunk_i,\n", " sim_m,\n", " MATRIX_THRESHOLD,\n", " sim_thr,\n", " ncliques,\n", " longest_clique_len,\n", " )\n", " experiments += result\n", " experiments += \"\\n\"\n", " experiments += \"
chunk typechunk sizesimilarity method
{}
{}{}{} {}\n", " {}
\n", " {}
\n", " {}\n", "
\\n{}\".format(post)\n", " if standalone:\n", " with open(EXPERIMENT_HTML, \"w\") as f:\n", " f.write(experiments)\n", " else:\n", " other_exps = experiments\n", "\n", " for stat in sorted(stats):\n", " TF.info(\"EXPERIMENT: {:>3} {}\".format(stats[stat], VALUE_LABELS[stat]))\n", " TF.info(\"EXPERIMENT: Generated html report\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.8.4 High level formatting functions\n", "\n", "Here everything concerning output is brought together." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[19]:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "def assess_exp(cf, np, nc, ll):\n", " return (\n", " \"out\"\n", " if cf\n", " else \"rec\"\n", " if ll > nc * REC_CLIQUE_RATIO / 100 and ll <= nc * DUB_CLIQUE_RATIO / 100\n", " else \"dep\"\n", " if ll > nc * DEP_CLIQUE_RATIO / 100\n", " else \"dub\"\n", " if ll > nc * DUB_CLIQUE_RATIO / 100\n", " else \"nor\"\n", " )" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def printing():\n", " global outputs, bin_cliques, base_name\n", " TF.info(\n", " \"PRINT ({} {} {} M>{} S>{}): sorting out cliques\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " )\n", " xt_cliques = {\n", " xterse_clique(c) for c in cliques\n", " } # chapter cliques as tuples of (b, ch) tuples\n", " bin_cliques = {\n", " c for c in xt_cliques if len(c) == 2\n", " } # chapter cliques with exactly two chapters\n", " # all chapters that occur in binary chapter cliques\n", " meta[\"# BINARY CHAPTER DIFFS\"] = len(bin_cliques)\n", "\n", " # We generate one kind of info for binary chapter cliques (the majority of cases).\n", " # The remaining cases are verse cliques that do not occur in such chapters, e.g. because they\n", " # have member chunks in the same chapter, or in multiple (more than two) chapters.\n", "\n", " ncliques = len(cliques)\n", " chapters_ok = assess_exp(CHUNK_FIXED, len(passages), ncliques, l_c_l) in {\n", " \"rec\",\n", " \"nor\",\n", " \"dub\",\n", " }\n", " cdoing = \"involving\" if chapters_ok else \"skipping\"\n", "\n", " TF.info(\n", " \"PRINT ({} {} {} M>{} S>{}): formatting {} cliques {} {} binary chapter diffs\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " ncliques,\n", " cdoing,\n", " len(bin_cliques),\n", " )\n", " )\n", " meta_html = \"\\n\".join(\"{:<40} : {:>10}\".format(k, str(meta[k])) for k in meta)\n", "\n", " base_name = \"{}_{}_{}_M{}_S{}\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " param_spec = \"\"\"\n", "\n", "\n", "\n", "\n", "\n", "
chunking method{}
chunking description{}
similarity method{}
similarity threshold{}
\n", " \"\"\".format(\n", " CHUNK_LABELS[CHUNK_FIXED],\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " param_lab = \"chunk-{}-{}-sim-{}-m{}-s{}\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " )\n", " index_name = base_name\n", " all_name = \"{}_{}\".format(\"all\", base_name)\n", " cliques_name = \"{}_{}\".format(\"clique\", base_name)\n", "\n", " clique_links = []\n", " clique_links.append(\n", " (\"{}/{}.html\".format(base_name, all_name), \"Big list of all cliques\")\n", " )\n", "\n", " nexist = 0\n", " nnew = 0\n", " if chapters_ok:\n", " chapter_diffs = []\n", " TF.info(\n", " \"PRINT ({} {} {} M>{} S>{}): Chapter diffs needed: {}\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(bin_cliques),\n", " )\n", " )\n", "\n", " bcc_text = \"

These results look good, so a binary chapter comparison has been generated

\"\n", " for cl in sorted(bin_cliques):\n", " lb1 = \"{} {}\".format(F.book.v(cl[0][0]), F.chapter.v(cl[0][1]))\n", " lb2 = \"{} {}\".format(F.book.v(cl[1][0]), F.chapter.v(cl[1][1]))\n", " hfilename = \"{}_vs_{}.html\".format(lb1, lb2).replace(\" \", \"_\")\n", " hfilepath = \"{}/{}/{}\".format(LOCAL_BASE_OUTP, CHAPTER_DIR, hfilename)\n", " chapter_diffs.append(\n", " (\n", " lb1,\n", " cl[0][1],\n", " lb2,\n", " cl[1][1],\n", " \"{}/{}/{}/{}\".format(\n", " SHEBANQ_TOOL,\n", " LOCAL_BASE_OUTP,\n", " CHAPTER_DIR,\n", " hfilename,\n", " ),\n", " )\n", " )\n", " if not os.path.exists(hfilepath):\n", " htext = compare_chapters(cl[0][1], cl[1][1], lb1, lb2)\n", " with open(hfilepath, \"w\") as f:\n", " f.write(htext)\n", " if VERBOSE:\n", " TF.info(\n", " \"PRINT ({} {} {} M>{} S>{}): written {}\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " hfilename,\n", " )\n", " )\n", " nnew += 1\n", " else:\n", " nexist += 1\n", " clique_links.append(\n", " (\n", " \"../{}/{}\".format(CHAPTER_DIR, hfilename),\n", " \"{} versus {}\".format(lb1, lb2),\n", " )\n", " )\n", " TF.info(\n", " \"PRINT ({} {} {} M>{} S>{}): Chapter diffs: {} newly created and {} already existing\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " nnew,\n", " nexist,\n", " )\n", " )\n", " else:\n", " bcc_text = \"

These results look dubious at best, so no binary chapter comparison has been generated

\"\n", "\n", " allgeni_html = (\n", " index_clique(cliques_name, i, c, ncliques) for (i, c) in enumerate(cliques)\n", " )\n", "\n", " allgen_htmls = []\n", " allgen_html = \"\"\n", "\n", " for (i, c) in enumerate(cliques):\n", " if i % CLIQUES_PER_FILE == 0:\n", " if i > 0:\n", " allgen_htmls.append(allgen_html)\n", " allgen_html = \"\"\n", " allgen_html += '

Clique {}

\\n{}'.format(\n", " i, i, print_clique(c, ncliques)\n", " )\n", " allgen_htmls.append(allgen_html)\n", "\n", " index_html_tpl = \"\"\"\n", "{}\n", "

Binary chapter comparisons

\n", "{}\n", "{}\n", " \"\"\"\n", "\n", " content_file_tpl = \"\"\"\n", "\n", "\n", "{}\n", "\n", "\n", "\n", "

{}

\n", "{}\n", "

more parameters and stats

\n", "{}\n", "

Parameters and stats

\n", "
{}
\n", "\n", "\"\"\"\n", "\n", " a_tpl_file = '

{}

'\n", "\n", " index_html_file = index_html_tpl.format(\n", " a_tpl_file.format(*clique_links[0]),\n", " bcc_text,\n", " \"\\n\".join(a_tpl_file.format(*c) for c in clique_links[1:]),\n", " )\n", "\n", " listing_html = \"{}\\n\".format(\n", " \"\\n\".join(allgeni_html),\n", " )\n", "\n", " for (subdir, fname, content_html, tit) in (\n", " (None, index_name, index_html_file, \"Index \" + param_lab),\n", " (base_name, all_name, listing_html, \"Listing \" + param_lab),\n", " (base_name, cliques_name, allgen_htmls, \"Cliques \" + param_lab),\n", " ):\n", " subdir = \"\" if subdir is None else (subdir + \"/\")\n", " subdirabs = \"{}/{}/{}\".format(LOCAL_BASE_OUTP, EXPERIMENT_DIR, subdir)\n", " if not os.path.exists(subdirabs):\n", " os.makedirs(subdirabs)\n", "\n", " if type(content_html) is list:\n", " for (i, c_h) in enumerate(content_html):\n", " fn = \"{}_{}\".format(fname, i)\n", " t = \"{}_{}\".format(tit, i)\n", " with open(\n", " \"{}/{}/{}{}.html\".format(\n", " LOCAL_BASE_OUTP, EXPERIMENT_DIR, subdir, fn\n", " ),\n", " \"w\",\n", " ) as f:\n", " f.write(\n", " content_file_tpl.format(t, css, t, param_spec, c_h, meta_html)\n", " )\n", " else:\n", " with open(\n", " \"{}/{}/{}{}.html\".format(\n", " LOCAL_BASE_OUTP, EXPERIMENT_DIR, subdir, fname\n", " ),\n", " \"w\",\n", " ) as f:\n", " f.write(\n", " content_file_tpl.format(\n", " tit, css, tit, param_spec, content_html, meta_html\n", " )\n", " )\n", " destination = outputs.setdefault(MATRIX_THRESHOLD, {})\n", " destination[(CHUNK_LB, CHUNK_DESC, SIMILARITY_METHOD, SIMILARITY_THRESHOLD)] = (\n", " len(passages),\n", " len(cliques),\n", " l_c_l,\n", " )\n", " TF.info(\n", " \"PRINT ({} {} {} M>{} S>{}): formatted {} cliques ({} files) {} {} binary chapter diffs\".format(\n", " CHUNK_LB,\n", " CHUNK_DESC,\n", " SIMILARITY_METHOD,\n", " MATRIX_THRESHOLD,\n", " SIMILARITY_THRESHOLD,\n", " len(cliques),\n", " len(allgen_htmls),\n", " cdoing,\n", " len(bin_cliques),\n", " )\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5.9 Running experiments\n", "\n", "The workflows of doing a single experiment, and then all experiments, are defined." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[20]:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "outputs = {}" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "def writeoutputs():\n", " global outputs\n", " with open(EXPERIMENT_PATH, \"wb\") as f:\n", " pickle.dump(outputs, f, protocol=PICKLE_PROTOCOL)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "def readoutputs():\n", " global outputs\n", " if not os.path.exists(EXPERIMENT_PATH):\n", " outputs = {}\n", " else:\n", " with open(EXPERIMENT_PATH, \"rb\") as f:\n", " outputs = pickle.load(f)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "def do_experiment(chunk_f, chunk_i, sim_m, sim_thr, do_index):\n", " if do_index:\n", " readoutputs()\n", " (do_chunk, do_prep, do_sim, do_clique, skip) = do_params(\n", " chunk_f, chunk_i, sim_m, sim_thr\n", " )\n", " if skip:\n", " return\n", " chunking(do_chunk)\n", " preparing(do_prep)\n", " similarity(do_sim)\n", " cliqueing(do_clique)\n", " printing()\n", " if do_index:\n", " writeoutputs()\n", " gen_html()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "def do_only_chunk(chunk_f, chunk_i):\n", " do_chunk = do_params_chunk(chunk_f, chunk_i)\n", " chunking(do_chunk)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "def reset_experiments():\n", " global outputs\n", " readoutputs()\n", " outputs = {}\n", " reset_params()\n", " writeoutputs()\n", " gen_html()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "def do_all_experiments(no_fixed=False, only_object=None):\n", " global outputs\n", " reset_experiments()\n", " for chunk_f in (False,) if no_fixed else (True, False):\n", " if chunk_f:\n", " chunk_items = CHUNK_SIZES\n", " else:\n", " chunk_items = CHUNK_OBJECTS if only_object is None else (only_object,)\n", " for chunk_i in chunk_items:\n", " for sim_m in SIM_METHODS:\n", " for sim_thr in SIMILARITIES:\n", " do_experiment(chunk_f, chunk_i, sim_m, sim_thr, False)\n", " writeoutputs()\n", " gen_html()\n", " gen_html(standalone=True)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "def do_all_chunks(no_fixed=False, only_object=None):\n", " global outputs\n", " reset_experiments()\n", " for chunk_f in (False,) if no_fixed else (True, False):\n", " if chunk_f:\n", " chunk_items = CHUNK_SIZES\n", " else:\n", " chunk_items = CHUNK_OBJECTS if only_object is None else (only_object,)\n", " for chunk_i in chunk_items:\n", " do_only_chunk(chunk_f, chunk_i)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def show_all_experiments():\n", " readoutputs()\n", " gen_html()\n", " gen_html(standalone=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# TF features\n", "\n", "Based on selected similarity matrices, we produce an\n", "edge features between verses, containing weighted links to parallel verses.\n", "\n", "The features to deliver are called `crossrefSET` and `crossrefLCS` and `crossref`.\n", "\n", "These are edge feature, both are symmetric, and hence redundant.\n", "For every node, the *from* and *to* edges are identical.\n", "\n", "The `SET` variant consists of set based similarity, the `LCS` one on longest common subsequence\n", "similarity.\n", "\n", "The `crossref` feature takes the union of both methods, with the average confidence.\n", "\n", "The weight is the similarity as percentage integer as it comes from the similarity matrix.\n", "\n", "## Discussion\n", "We only produce the results of the similarity computation (the matrix), we do not do the cliqueing.\n", "There are many ways to make cliques, and that can easily be done by users of the data, once the\n", "matrix results are in place.\n", "We also do not produce pretty outputs, chapter diffs and other goodies.\n", "Just the raw similarity data.\n", "\n", "The matrix computation is expensive.\n", "We use fixed settings:\n", "* verse chunks\n", "* `SET` method / `LCS` method,\n", "* matrix threshold 50 / 60\n", "* similarity threshold 75\n", "\n", "That is, we compute a matrix that contains all pairs with similarity above 50 or 60\n", "depending on whether we do the `SET` method or the `LCS` method.\n", "\n", "From that matrix, we only use the similarities above 75.\n", "This gives us room to play without recomputing the matrix.\n", "\n", "We do not want to redo this computation if it can be avoided.\n", "\n", "Verse similarity is not something that is very sensitive to change in the encoding.\n", "It is very likely that similar verses in one version of the data agree with similar\n", "verses in all other versions.\n", "\n", "However, the node numbers of verses may change from version to version, so that part\n", "must be done again for each version.\n", "\n", "This is how we proceed:\n", "* the matrix computation gives us triples `(v1, v2, w)`, where `v1`, `v2` are verse nodes and `d` is their similarity\n", "* we store the result of the matrix computation in a CSV file with the following fields:\n", "* `method, v1, v1Ref, v2, v2Ref, d`, where `v1Ref` and `v2Ref` are verse references,\n", " each containing exactly 3 fields: book, chapter, verse\n", "* NB: the similarity table has only one entry for each pair of similar verses per method.\n", " If `(v1, v2)` is in the table, `(v2, v1)` is not in the table, per method.\n", "\n", "When we run this notebook for the pipeline, we check for the presence of this file.\n", "If it is present, we uses the `vRefs` in it to compute the verse nodes that are valid for the\n", "version we are going to produce.\n", "That gives us all the data we need, so we can skip the matrix computation.\n", "\n", "If the file is not present, we have to compute the matrix.\n", "There will be a parameter, called `FORCE_MATRIX`, which can enforce a re-computation of the matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need some utility function geared to TF feature production.\n", "The `get_verse()` function is simpler, and we do not have to run full experiments." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[21]:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "def writeSimTable(similars):\n", " with open(TF_TABLE, \"w\") as h:\n", " for entry in similars:\n", " h.write(\"{}\\n\".format(\"\\t\".join(str(x) for x in entry)))" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "def readSimTable():\n", " similars = []\n", " stats = set()\n", "\n", " with open(TF_TABLE) as h:\n", " for line in h:\n", " (\n", " method,\n", " v1,\n", " v2,\n", " sim,\n", " book1,\n", " chapter1,\n", " verse1,\n", " book2,\n", " chapter2,\n", " verse2,\n", " ) = line.rstrip(\"\\n\").split(\"\\t\")\n", " verseNode1 = T.nodeFromSection((book1, int(chapter1), int(verse1)))\n", " verseNode2 = T.nodeFromSection((book2, int(chapter2), int(verse2)))\n", " if verseNode1 != int(v1):\n", " stats.add(verseNode1)\n", " if verseNode2 != int(v2):\n", " stats.add(verseNode2)\n", " similars.append(\n", " (\n", " method,\n", " verseNode1,\n", " verseNode2,\n", " int(sim),\n", " book1,\n", " int(chapter1),\n", " int(verse1),\n", " book2,\n", " int(chapter2),\n", " int(verse2),\n", " )\n", " )\n", " nStats = len(stats)\n", " if nStats:\n", " utils.caption(\n", " 0,\n", " \"\\t\\tINFO: {} verse nodes have been changed between versions\".format(\n", " nStats\n", " ),\n", " )\n", " utils.caption(0, \"\\t\\tINFO: We will save and use the recomputed ones\")\n", " writeSimTable(similars)\n", " else:\n", " utils.caption(\n", " 0, \"\\t\\tINFO: All verse nodes are the same as in the previous version\"\n", " )\n", " return similars" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def makeSimTable():\n", " similars = []\n", " for (method, similarityCutoff) in (\n", " (\"SET\", 75),\n", " (\"LCS\", 75),\n", " ):\n", " (do_chunk, do_prep, do_sim, do_clique, skip) = do_params(\n", " False, \"verse\", method, similarityCutoff\n", " )\n", " chunking(do_chunk)\n", " preparing(do_prep)\n", " similarity(do_sim or FORCE_MATRIX)\n", " theseSimilars = []\n", " for ((chunk1, chunk2), sim) in sorted(\n", " (x, d) for (x, d) in chunk_dist.items() if d >= similarityCutoff\n", " ):\n", " verseNode1 = L.u(chunks[chunk1][0], otype=\"verse\")[0]\n", " verseNode2 = L.u(chunks[chunk2][0], otype=\"verse\")[0]\n", " simInt = int(round(sim))\n", " heading1 = T.sectionFromNode(verseNode1)\n", " heading2 = T.sectionFromNode(verseNode2)\n", " theseSimilars.append(\n", " (method, verseNode1, verseNode2, simInt, *heading1, *heading2)\n", " )\n", " utils.caption(\n", " 0,\n", " \"\\tMethod {}: found {} similar pairs of verses\".format(\n", " method, len(theseSimilars)\n", " ),\n", " )\n", " similars.extend(theseSimilars)\n", " writeSimTable(similars)\n", " return similars" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[22]:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 13s CROSSREFS: Fetching crossrefs .\n", "..............................................................................................\n" ] } ], "source": [ "utils.caption(4, \"CROSSREFS: Fetching crossrefs\")" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 13s \tReading existing /Users/werk/github/etcbc/parallels/_temp/parallelTable.tsv\n" ] } ], "source": [ "xTable = os.path.exists(TF_TABLE)\n", "if FORCE_MATRIX:\n", " utils.caption(\n", " 0,\n", " \"\\t{} requested of {}\".format(\n", " \"Recomputing\" if xTable else \"computing\",\n", " TF_TABLE,\n", " ),\n", " )\n", "else:\n", " if xTable:\n", " utils.caption(0, \"\\tReading existing {}\".format(TF_TABLE))\n", " else:\n", " utils.caption(0, \"\\tComputing missing {}\".format(TF_TABLE))" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 13s \t\tINFO: All verse nodes are the same as in the previous version\n" ] } ], "source": [ "if FORCE_MATRIX or not xTable:\n", " similars = makeSimTable()\n", "else:\n", " similars = readSimTable()" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[23]:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('LCS', 1414401, 1414407, 84, 'Genesis', 1, 13, 'Genesis', 1, 19)\n", "('LCS', 1414401, 1414411, 89, 'Genesis', 1, 13, 'Genesis', 1, 23)\n", "('LCS', 1414403, 1414405, 77, 'Genesis', 1, 15, 'Genesis', 1, 17)\n", "('LCS', 1414407, 1414411, 84, 'Genesis', 1, 19, 'Genesis', 1, 23)\n", "('LCS', 1414498, 1414501, 79, 'Genesis', 5, 4, 'Genesis', 5, 7)\n", "('LCS', 1414498, 1414507, 75, 'Genesis', 5, 4, 'Genesis', 5, 13)\n", "('LCS', 1414498, 1414510, 78, 'Genesis', 5, 4, 'Genesis', 5, 16)\n", "('LCS', 1414498, 1414513, 86, 'Genesis', 5, 4, 'Genesis', 5, 19)\n", "('LCS', 1414498, 1414524, 77, 'Genesis', 5, 4, 'Genesis', 5, 30)\n", "('LCS', 1414498, 1414666, 79, 'Genesis', 5, 4, 'Genesis', 11, 11)\n", "('SET', 1414505, 1414623, 80, 'Genesis', 5, 11, 'Genesis', 9, 29)\n", "('SET', 1414510, 1414513, 77, 'Genesis', 5, 16, 'Genesis', 5, 19)\n", "('SET', 1414625, 1435841, 100, 'Genesis', 10, 2, '1_Chronicles', 1, 5)\n", "('SET', 1414629, 1435844, 100, 'Genesis', 10, 6, '1_Chronicles', 1, 8)\n", "('SET', 1414630, 1435845, 100, 'Genesis', 10, 7, '1_Chronicles', 1, 9)\n", "('SET', 1414631, 1435846, 100, 'Genesis', 10, 8, '1_Chronicles', 1, 10)\n", "('SET', 1414636, 1435847, 100, 'Genesis', 10, 13, '1_Chronicles', 1, 11)\n", "('SET', 1414637, 1435848, 100, 'Genesis', 10, 14, '1_Chronicles', 1, 12)\n", "('SET', 1414638, 1435849, 100, 'Genesis', 10, 15, '1_Chronicles', 1, 13)\n", "('SET', 1414639, 1414770, 83, 'Genesis', 10, 16, 'Genesis', 15, 21)\n" ] } ], "source": [ "if not SCRIPT:\n", " print(\"\\n\".join(sorted(repr(sim) for sim in similars if sim[0] == \"LCS\")[0:10]))\n", " print(\"\\n\".join(sorted(repr(sim) for sim in similars if sim[0] == \"SET\")[0:10]))" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "crossrefData = {}\n", "otherMethod = dict(LCS=\"SET\", SET=\"LCS\")" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "for (method, v1, v2, sim, *x) in similars:\n", " crossrefData.setdefault(method, {}).setdefault(v1, {})[v2] = sim\n", " crossrefData.setdefault(method, {}).setdefault(v2, {})[v1] = sim\n", " omethod = otherMethod[method]\n", " otherSim = crossrefData.get(omethod, {}).get(v1, {}).get(v2, None)\n", " thisSim = sim if otherSim is None else int(round((otherSim + sim) / 2))\n", " crossrefData.setdefault(\"\", {}).setdefault(v1, {})[v2] = thisSim\n", " crossrefData.setdefault(\"\", {}).setdefault(v2, {})[v1] = thisSim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating parallels module for Text-Fabric\n", "\n", "We generate the feature `crossref`.\n", "It is an edge feature between verse nodes, with the similarity as weight." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 6m 16s Writing TF parallel features .\n", "..............................................................................................\n" ] } ], "source": [ "utils.caption(4, \"Writing TF parallel features\")" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "newFeatureStr = \"crossref crossrefSET crossrefLCS\"\n", "newFeatures = newFeatureStr.strip().split()" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "genericMetaPath = f\"{thisRepo}/yaml/generic.yaml\"\n", "parallelsMetaPath = f\"{thisRepo}/yaml/parallels.yaml\"\n", "\n", "with open(genericMetaPath) as fh:\n", " genericMeta = yaml.load(fh, Loader=yaml.FullLoader)\n", " genericMeta[\"version\"] = VERSION\n", "with open(parallelsMetaPath) as fh:\n", " parallelsMeta = formatMeta(yaml.load(fh, Loader=yaml.FullLoader))\n", "\n", "metaData = {\"\": genericMeta, **parallelsMeta}" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "nodeFeatures = dict()\n", "edgeFeatures = dict()\n", "for method in [\"\"] + list(otherMethod):\n", " edgeFeatures[\"crossref{}\".format(method)] = crossrefData[method]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "for newFeature in newFeatures:\n", " metaData[newFeature][\"valueType\"] = \"int\"\n", " metaData[newFeature][\"edgeValues\"] = True" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TF = Fabric(locations=thisTempTf, silent=True)\n", "TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generating simple `crossref` notes for SHEBANQ\n", "We base them on the average of both methods, we supply the confidence." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[33]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MAX_REFS = 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def condenseX(vlabels):\n", " cnd = []\n", " (cur_b, cur_c) = (None, None)\n", " for (b, c, v, d) in vlabels:\n", " sep = (\n", " \"\"\n", " if cur_b is None\n", " else \". \"\n", " if cur_b != b\n", " else \"; \"\n", " if cur_c != c\n", " else \", \"\n", " )\n", " show_b = b + \" \" if cur_b != b else \"\"\n", " show_c = str(c) + \":\" if cur_b != b or cur_c != c else \"\"\n", " (cur_b, cur_c) = (b, c)\n", " cnd.append(\"{}[{}{}{}{}]\".format(sep, show_b, show_c, v, d))\n", " return cnd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "crossrefBase = crossrefData[\"\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "refsGrouped = []\n", "nCrossrefs = 0\n", "for (x, refs) in crossrefBase.items():\n", " vys = sorted(refs.keys())\n", " nCrossrefs += len(vys)\n", " currefs = []\n", " for vy in vys:\n", " nr = len(currefs)\n", " if nr == MAX_REFS:\n", " refsGrouped.append((x, tuple(currefs)))\n", " currefs = []\n", " currefs.append(vy)\n", " if len(currefs):\n", " refsGrouped.append((x, tuple(currefs)))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 8m 58s Compiled 31742 cross references into 6215 notes\n" ] } ], "source": [ "refsCompiled = []\n", "for (x, vys) in refsGrouped:\n", " vysd = [\n", " (*T.sectionFromNode(vy, lang=\"la\"), \" ~{}%\".format(crossrefBase[x][vy]))\n", " for vy in vys\n", " ]\n", " vysl = condenseX(vysd)\n", " these_refs = []\n", " for (i, vy) in enumerate(vysd):\n", " link_text = vysl[i]\n", " link_target = \"{} {}:{}\".format(vy[0], vy[1], vy[2])\n", " these_refs.append(\"{}({})\".format(link_text, link_target))\n", " refsCompiled.append((x, \" \".join(these_refs)))\n", "utils.caption(\n", " 0,\n", " \"Compiled {} cross references into {} notes\".format(nCrossrefs, len(refsCompiled)),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[34]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sfields = \"\"\"\n", " version\n", " book\n", " chapter\n", " verse\n", " clause_atom\n", " is_shared\n", " is_published\n", " status\n", " keywords\n", " ntext\n", "\"\"\".strip().split()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sfields_fmt = (\"{}\\t\" * (len(sfields) - 1)) + \"{}\\n\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ofs = open(\"{}/{}\".format(thisNotes, notesFile), \"w\")\n", "ofs.write(\"{}\\n\".format(\"\\t\".join(sfields)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for (v, refs) in refsCompiled:\n", " firstWord = L.d(v, otype=\"word\")[0]\n", " ca = F.number.v(L.u(firstWord, otype=\"clause_atom\")[0])\n", " (bk, ch, vs) = T.sectionFromNode(v, lang=\"la\")\n", " ofs.write(\n", " sfields_fmt.format(\n", " VERSION,\n", " bk,\n", " ch,\n", " vs,\n", " ca,\n", " \"T\",\n", " \"\",\n", " CROSSREF_STATUS,\n", " CROSSREF_KEYWORD,\n", " refs,\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| 8m 58s Generated 6215 notes\n" ] } ], "source": [ "utils.caption(0, \"Generated {} notes\".format(len(refsCompiled)))\n", "ofs.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Diffs\n", "\n", "Check differences with previous versions." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[35]:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 9m 05s Check differences with previous version .\n", "..............................................................................................\n", "| 9m 05s \t3 features to add\n", "| 9m 05s \t\tcrossref\n", "| 9m 05s \t\tcrossrefLCS\n", "| 9m 05s \t\tcrossrefSET\n", "| 9m 05s \tno features to delete\n", "| 9m 05s \t0 features in common\n", "| 9m 05s Done\n" ] } ], "source": [ "utils.checkDiffs(thisTempTf, thisTf, only=set(newFeatures))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Deliver\n", "\n", "Copy the new TF feature from the temporary location where it has been created to its final destination." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[36]:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 9m 19s Deliver data set to /Users/dirk/github/etcbc/parallels/tf/2021 .\n", "..............................................................................................\n" ] } ], "source": [ "utils.deliverDataset(thisTempTf, thisTf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Compile TF" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[38]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "utils.caption(4, \"Load and compile the new TF features\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 10m 25s Load and compile the new TF features .\n", "..............................................................................................\n", "This is Text-Fabric 8.5.13\n", "Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html\n", "\n", "117 features found and 0 ignored\n", " 0.00s loading features ...\n", " | 0.00s Dataset without structure sections in otext:no structure functions in the T-API\n", " 3.47s All features loaded/computed - for details use loadLog()\n" ] }, { "data": { "text/plain": [ "[('Computed',\n", " 'computed-data',\n", " ('C Computed', 'Call AllComputeds', 'Cs ComputedString')),\n", " ('Features', 'edge-features', ('E Edge', 'Eall AllEdges', 'Es EdgeString')),\n", " ('Fabric', 'loading', ('TF',)),\n", " ('Locality', 'locality', ('L Locality',)),\n", " ('Nodes', 'navigating-nodes', ('N Nodes',)),\n", " ('Features',\n", " 'node-features',\n", " ('F Feature', 'Fall AllFeatures', 'Fs FeatureString')),\n", " ('Search', 'search', ('S Search',)),\n", " ('Text', 'text', ('T Text',))]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TF = Fabric(locations=[coreTf, thisTf], modules=[\"\"])\n", "api = TF.load(newFeatureStr)\n", "api.makeAvailableIn(globals())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We list all the `crossrefs` that the verses of Genesis 10 are involved in." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[39]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "utils.caption(4, \"Test: crossrefs of Genesis 10\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chapter = (\"Genesis\", 10)\n", "chapterNode = T.nodeFromSection(chapter)\n", "startVerses = {}" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "..............................................................................................\n", ". 10m 33s Test: crossrefs of Genesis 10 .\n", "..............................................................................................\n", "| 10m 33s \tMethod \n", "| 10m 33s \t\t20 start verses\n", "\t\tGenesis 10:2\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:5 confidende 100%\n", "\t\tGenesis 10:3\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:6 confidende 95%\n", "\t\tGenesis 10:4\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:7 confidende 95%\n", "\t\tGenesis 10:6\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:8 confidende 100%\n", "\t\tGenesis 10:7\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:9 confidende 100%\n", "\t\tGenesis 10:8\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:10 confidende 100%\n", "\t\tGenesis 10:13\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:11 confidende 100%\n", "\t\tGenesis 10:14\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:12 confidende 100%\n", "\t\tGenesis 10:15\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:13 confidende 100%\n", "\t\tGenesis 10:16\n", "| 10m 33s \t\t ----------> Genesis 15:21 confidende 83%\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:14 confidende 100%\n", "\t\tGenesis 10:17\n", "| 10m 33s \t\t ----------> Genesis 15:20 confidende 76%\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:15 confidende 100%\n", "\t\tGenesis 10:20\n", "| 10m 33s \t\t ----------> Genesis 10:31 confidende 87%\n", "\t\tGenesis 10:22\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:17 confidende 77%\n", "\t\tGenesis 10:24\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:18 confidende 100%\n", "\t\tGenesis 10:25\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:19 confidende 100%\n", "\t\tGenesis 10:26\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:20 confidende 100%\n", "\t\tGenesis 10:27\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:21 confidende 100%\n", "| 10m 33s \t\t ----------> 2_Chronicles 11:9 confidende 78%\n", "\t\tGenesis 10:28\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:22 confidende 100%\n", "\t\tGenesis 10:29\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:23 confidende 100%\n", "\t\tGenesis 10:31\n", "| 10m 33s \t\t ----------> Genesis 10:20 confidende 87%\n", "| 10m 33s \tMethod SET\n", "| 10m 33s \t\t20 start verses\n", "\t\tGenesis 10:2\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:5 confidende 100%\n", "\t\tGenesis 10:3\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:6 confidende 95%\n", "\t\tGenesis 10:4\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:7 confidende 95%\n", "\t\tGenesis 10:6\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:8 confidende 100%\n", "\t\tGenesis 10:7\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:9 confidende 100%\n", "\t\tGenesis 10:8\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:10 confidende 100%\n", "\t\tGenesis 10:13\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:11 confidende 100%\n", "\t\tGenesis 10:14\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:12 confidende 100%\n", "\t\tGenesis 10:15\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:13 confidende 100%\n", "\t\tGenesis 10:16\n", "| 10m 33s \t\t ----------> Genesis 15:21 confidende 83%\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:14 confidende 100%\n", "\t\tGenesis 10:17\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:15 confidende 100%\n", "\t\tGenesis 10:20\n", "| 10m 33s \t\t ----------> Genesis 10:31 confidende 80%\n", "\t\tGenesis 10:22\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:17 confidende 77%\n", "\t\tGenesis 10:24\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:18 confidende 100%\n", "\t\tGenesis 10:25\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:19 confidende 100%\n", "\t\tGenesis 10:26\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:20 confidende 100%\n", "\t\tGenesis 10:27\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:21 confidende 100%\n", "\t\tGenesis 10:28\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:22 confidende 100%\n", "\t\tGenesis 10:29\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:23 confidende 100%\n", "\t\tGenesis 10:31\n", "| 10m 33s \t\t ----------> Genesis 10:20 confidende 80%\n", "| 10m 33s \tMethod LCS\n", "| 10m 33s \t\t20 start verses\n", "\t\tGenesis 10:2\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:5 confidende 100%\n", "\t\tGenesis 10:3\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:6 confidende 95%\n", "\t\tGenesis 10:4\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:7 confidende 95%\n", "\t\tGenesis 10:6\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:8 confidende 100%\n", "\t\tGenesis 10:7\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:9 confidende 100%\n", "\t\tGenesis 10:8\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:10 confidende 100%\n", "\t\tGenesis 10:13\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:11 confidende 100%\n", "\t\tGenesis 10:14\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:12 confidende 100%\n", "\t\tGenesis 10:15\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:13 confidende 100%\n", "\t\tGenesis 10:16\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:14 confidende 100%\n", "\t\tGenesis 10:17\n", "| 10m 33s \t\t ----------> Genesis 15:20 confidende 76%\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:15 confidende 100%\n", "\t\tGenesis 10:20\n", "| 10m 33s \t\t ----------> Genesis 10:31 confidende 94%\n", "\t\tGenesis 10:22\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:17 confidende 77%\n", "\t\tGenesis 10:24\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:18 confidende 100%\n", "\t\tGenesis 10:25\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:19 confidende 100%\n", "\t\tGenesis 10:26\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:20 confidende 100%\n", "\t\tGenesis 10:27\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:21 confidende 100%\n", "| 10m 33s \t\t ----------> 2_Chronicles 11:9 confidende 78%\n", "\t\tGenesis 10:28\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:22 confidende 100%\n", "\t\tGenesis 10:29\n", "| 10m 33s \t\t ----------> 1_Chronicles 1:23 confidende 100%\n", "\t\tGenesis 10:31\n", "| 10m 33s \t\t ----------> Genesis 10:20 confidende 94%\n" ] } ], "source": [ "for method in [\"\", \"SET\", \"LCS\"]:\n", " utils.caption(0, \"\\tMethod {}\".format(method))\n", " for verseNode in L.d(chapterNode, otype=\"verse\"):\n", " crossrefs = Es(\"crossref{}\".format(method)).f(verseNode)\n", " if crossrefs:\n", " startVerses[T.sectionFromNode(verseNode)] = crossrefs\n", " utils.caption(0, \"\\t\\t{} start verses\".format(len(startVerses)))\n", " for (start, crossrefs) in sorted(startVerses.items()):\n", " utils.caption(0, \"\\t\\t{} {}:{}\".format(*start), continuation=True)\n", " for (target, confidence) in crossrefs:\n", " utils.caption(\n", " 0,\n", " \"\\t\\t{:>20} {:<20} confidende {:>3}%\".format(\n", " \"-\" * 10 + \">\",\n", " \"{} {}:{}\".format(*T.sectionFromNode(target)),\n", " confidence,\n", " ),\n", " )" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[29]:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "if SCRIPT:\n", " stop(good=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6b. SHEBANQ annotations\n", "\n", "The code below generates extensive `crossref` notes for `4b`, including clique overviews and chapter diffs.\n", "But since the pipeline in October 2017, we generate much simpler notes.\n", "That code is above.\n", "\n", "We retain this code here, in case we want to expand the `crossref` functionality in the future again.\n", "\n", "Based on selected similarity matrices, we produce a SHEBANQ note set of cross references for similar passages." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[30]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_verse(i, ca=False):\n", " return get_verse_w(chunks[i][0], ca=ca)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_verse_o(o, ca=False):\n", " return get_verse_w(L.d(o, otype=\"word\")[0], ca=ca)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_verse_w(w, ca=False):\n", " book = F.book.v(L.u(w, otype=\"book\")[0])\n", " chapter = F.chapter.v(L.u(w, otype=\"chapter\")[0])\n", " verse = F.verse.v(L.u(w, otype=\"verse\")[0])\n", " if ca:\n", " ca = F.number.v(L.u(w, otype=\"clause_atom\")[0])\n", " return (book, chapter, verse, ca) if ca else (book, chapter, verse)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def key_verse(x):\n", " return (book_rank[x[0]], int(x[1]), int(x[2]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MAX_REFS = 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def condensex(vlabels):\n", " cnd = []\n", " (cur_b, cur_c) = (None, None)\n", " for (b, c, v, d) in vlabels:\n", " sep = (\n", " \"\"\n", " if cur_b is None\n", " else \". \"\n", " if cur_b != b\n", " else \"; \"\n", " if cur_c != c\n", " else \", \"\n", " )\n", " show_b = b + \" \" if cur_b != b else \"\"\n", " show_c = c + \":\" if cur_b != b or cur_c != c else \"\"\n", " (cur_b, cur_c) = (b, c)\n", " cnd.append(\"{}{}{}{}{}\".format(sep, show_b, show_c, v, d))\n", " return cnd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfields = \"\"\"\n", " book1\n", " chapter1\n", " verse1\n", " book2\n", " chapter2\n", " verse2\n", " similarity\n", "\"\"\".strip().split()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfields_fmt = (\"{}\\t\" * (len(dfields) - 1)) + \"{}\\n\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_crossrefs():\n", " global crossrefs\n", " TF.info(\"CROSSREFS: Fetching crossrefs\")\n", " crossrefs_proto = {}\n", " crossrefs = {}\n", " (chunk_f, chunk_i, sim_m) = SHEBANQ_MATRIX\n", " sim_thr = SHEBANQ_SIMILARITY\n", " (do_chunk, do_prep, do_sim, do_clique, skip) = do_params(\n", " chunk_f, chunk_i, sim_m, sim_thr\n", " )\n", " if skip:\n", " return\n", " TF.info(\n", " \"CROSSREFS ({} {} {} S>{})\".format(CHUNK_LBS[chunk_f], chunk_i, sim_m, sim_thr)\n", " )\n", " crossrefs_proto = {x for x in chunk_dist.items() if x[1] >= sim_thr}\n", " TF.info(\n", " \"CROSSREFS ({} {} {} S>{}): found {} pairs\".format(\n", " CHUNK_LBS[chunk_f],\n", " chunk_i,\n", " sim_m,\n", " sim_thr,\n", " len(crossrefs_proto),\n", " )\n", " )\n", " f = open(CROSSREF_DB_PATH, \"w\")\n", " f.write(\"{}\\n\".format(\"\\t\".join(dfields)))\n", " for ((x, y), d) in crossrefs_proto:\n", " vx = get_verse(x)\n", " vy = get_verse(y)\n", " rd = int(round(d))\n", " crossrefs.setdefault(x, {})[vy] = rd\n", " crossrefs.setdefault(y, {})[vx] = rd\n", " f.write(dfields_fmt.format(*(vx + vy + (rd,))))\n", " total = sum(len(x) for x in crossrefs.values())\n", " f.close()\n", " TF.info(\n", " \"CROSSREFS: Found {} crossreferences and wrote {} pairs\".format(\n", " total, len(crossrefs_proto)\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_specific_crossrefs(chunk_f, chunk_i, sim_m, sim_thr, write_to):\n", " (do_chunk, do_prep, do_sim, do_clique, skip) = do_params(\n", " chunk_f, chunk_i, sim_m, sim_thr\n", " )\n", " if skip:\n", " return\n", " chunking(do_chunk)\n", " preparing(do_prep)\n", " similarity(do_sim)\n", "\n", " TF.info(\"CROSSREFS: Fetching crossrefs\")\n", " crossrefs_proto = {}\n", " crossrefs = {}\n", " (do_chunk, do_prep, do_sim, do_clique, skip) = do_params(\n", " chunk_f, chunk_i, sim_m, sim_thr\n", " )\n", " if skip:\n", " return\n", " TF.info(\n", " \"CROSSREFS ({} {} {} S>{})\".format(CHUNK_LBS[chunk_f], chunk_i, sim_m, sim_thr)\n", " )\n", " crossrefs_proto = {x for x in chunk_dist.items() if x[1] >= sim_thr}\n", " TF.info(\n", " \"CROSSREFS ({} {} {} S>{}): found {} pairs\".format(\n", " CHUNK_LBS[chunk_f],\n", " chunk_i,\n", " sim_m,\n", " sim_thr,\n", " len(crossrefs_proto),\n", " )\n", " )\n", " f = open(\"files/{}\".format(write_to), \"w\")\n", " f.write(\"{}\\n\".format(\"\\t\".join(dfields)))\n", " for ((x, y), d) in crossrefs_proto:\n", " vx = get_verse(x)\n", " vy = get_verse(y)\n", " rd = int(round(d))\n", " crossrefs.setdefault(x, {})[vy] = rd\n", " crossrefs.setdefault(y, {})[vx] = rd\n", " f.write(dfields_fmt.format(*(vx + vy + (rd,))))\n", " total = sum(len(x) for x in crossrefs.values())\n", " f.close()\n", " TF.info(\n", " \"CROSSREFS: Found {} crossreferences and wrote {} pairs\".format(\n", " total, len(crossrefs_proto)\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compile_refs():\n", " global refs_compiled\n", " refs_grouped = []\n", " for x in sorted(crossrefs):\n", " refs = crossrefs[x]\n", " vys = sorted(refs.keys(), key=key_verse)\n", " currefs = []\n", " for vy in vys:\n", " nr = len(currefs)\n", " if nr == MAX_REFS:\n", " refs_grouped.append((x, tuple(currefs)))\n", " currefs = []\n", " currefs.append(vy)\n", " if len(currefs):\n", " refs_grouped.append((x, tuple(currefs)))\n", " refs_compiled = []\n", " for (x, vys) in refs_grouped:\n", " vysd = [(vy[0], vy[1], vy[2], \" ~{}%\".format(crossrefs[x][vy])) for vy in vys]\n", " vysl = condensex(vysd)\n", " these_refs = []\n", " for (i, vy) in enumerate(vysd):\n", " link_text = vysl[i]\n", " link_target = \"{} {}:{}\".format(vy[0], vy[1], vy[2])\n", " these_refs.append(\"[{}]({})\".format(link_text, link_target))\n", " refs_compiled.append((x, \" \".join(these_refs)))\n", " TF.info(\n", " \"CROSSREFS: Compiled cross references into {} notes\".format(len(refs_compiled))\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_chapter_diffs():\n", " global chapter_diffs\n", " chapter_diffs = []\n", " for cl in sorted(bin_cliques):\n", " lb1 = \"{} {}\".format(F.book.v(cl[0][0]), F.chapter.v(cl[0][1]))\n", " lb2 = \"{} {}\".format(F.book.v(cl[1][0]), F.chapter.v(cl[1][1]))\n", " hfilename = \"{}_vs_{}.html\".format(lb1, lb2).replace(\" \", \"_\")\n", " chapter_diffs.append(\n", " (\n", " lb1,\n", " cl[0][1],\n", " lb2,\n", " cl[1][1],\n", " \"{}/{}/{}/{}\".format(\n", " SHEBANQ_TOOL,\n", " LOCAL_BASE_OUTP,\n", " CHAPTER_DIR,\n", " hfilename,\n", " ),\n", " )\n", " )\n", " TF.info(\"CROSSREFS: Added {} chapter diffs\".format(2 * len(chapter_diffs)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_clique_refs():\n", " global clique_refs\n", " clique_refs = []\n", " for (i, c) in enumerate(cliques):\n", " for j in c:\n", " seq = i // CLIQUES_PER_FILE\n", " clique_refs.append(\n", " (\n", " j,\n", " i,\n", " \"{}/{}/{}/{}/clique_{}_{}.html#c_{}\".format(\n", " SHEBANQ_TOOL,\n", " LOCAL_BASE_OUTP,\n", " EXPERIMENT_DIR,\n", " base_name,\n", " base_name,\n", " seq,\n", " i,\n", " ),\n", " )\n", " )\n", " TF.info(\"CROSSREFS: Added {} clique references\".format(len(clique_refs)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sfields = \"\"\"\n", " version\n", " book\n", " chapter\n", " verse\n", " clause_atom\n", " is_shared\n", " is_published\n", " status\n", " keywords\n", " ntext\n", "\"\"\".strip().split()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sfields_fmt = (\"{}\\t\" * (len(sfields) - 1)) + \"{}\\n\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_notes():\n", " with open(NOTES_PATH, \"w\") as f:\n", " f.write(\"{}\\n\".format(\"\\t\".join(sfields)))\n", " x = next(F.otype.s(\"word\"))\n", " (bk, ch, vs, ca) = get_verse(x, ca=True)\n", " f.write(\n", " sfields_fmt.format(\n", " VERSION,\n", " bk,\n", " ch,\n", " vs,\n", " ca,\n", " \"T\",\n", " \"\",\n", " CROSSREF_STATUS,\n", " CROSSREF_KEYWORD,\n", " \"\"\"The crossref notes are the result of a computation without manual tweaks.\n", "Parameters: chunk by verse, similarity method SET with threshold 65.\n", "[Here](tool=parallel) is an account of the generation method.\"\"\".replace(\n", " \"\\n\", \" \"\n", " ),\n", " )\n", " )\n", " for (lb1, ch1, lb2, ch2, fl) in chapter_diffs:\n", " (bk1, ch1, vs1, ca1) = get_verse_o(ch1, ca=True)\n", " (bk2, ch2, vs2, ca2) = get_verse_o(ch2, ca=True)\n", " f.write(\n", " sfields_fmt.format(\n", " VERSION,\n", " bk1,\n", " ch1,\n", " vs1,\n", " ca1,\n", " \"T\",\n", " \"\",\n", " CROSSREF_STATUS,\n", " CROSSREF_KEYWORD,\n", " \"[chapter diff with {}](tool:{})\".format(lb2, fl),\n", " )\n", " )\n", " f.write(\n", " sfields_fmt.format(\n", " VERSION,\n", " bk2,\n", " ch2,\n", " vs2,\n", " ca2,\n", " \"T\",\n", " \"\",\n", " CROSSREF_STATUS,\n", " CROSSREF_KEYWORD,\n", " \"[chapter diff with {}](tool:{})\".format(lb1, fl),\n", " )\n", " )\n", " for (x, refs) in refs_compiled:\n", " (bk, ch, vs, ca) = get_verse(x, ca=True)\n", " f.write(\n", " sfields_fmt.format(\n", " VERSION,\n", " bk,\n", " ch,\n", " vs,\n", " ca,\n", " \"T\",\n", " \"\",\n", " CROSSREF_STATUS,\n", " CROSSREF_KEYWORD,\n", " refs,\n", " )\n", " )\n", " for (chunk, clique, fl) in clique_refs:\n", " (bk, ch, vs, ca) = get_verse(chunk, ca=True)\n", " f.write(\n", " sfields_fmt.format(\n", " VERSION,\n", " bk,\n", " ch,\n", " vs,\n", " ca,\n", " \"T\",\n", " \"\",\n", " CROSSREF_STATUS,\n", " CROSSREF_KEYWORD,\n", " \"[all variants (clique {})](tool:{})\".format(clique, fl),\n", " )\n", " )\n", "\n", " TF.info(\n", " \"CROSSREFS: Generated {} notes\".format(\n", " 1 + len(refs_compiled) + 2 * len(chapter_diffs) + len(clique_refs)\n", " )\n", " )" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "def crossrefs2shebanq():\n", " expr = SHEBANQ_MATRIX + (SHEBANQ_SIMILARITY,)\n", " do_experiment(*(expr + (True,)))\n", " get_crossrefs()\n", " compile_refs()\n", " get_chapter_diffs()\n", " get_clique_refs()\n", " generate_notes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Main\n", "\n", "In the cell below you can select the experiments you want to carry out.\n", "\n", "The previous cells contain just definitions and parameters.\n", "The next cell will do work.\n", "\n", "If none of the matrices and cliques have been computed before on the system where this runs, doing all experiments might take multiple hours (4-8)." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[ ]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "reset_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "do_experiment(False, 'sentence', 'LCS', 60, False)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "do_all_experiments()" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "```\n", "do_all_experiments(no_fixed=True, only_object='chapter')\n", "crossrefs2shebanq()\n", "show_all_experiments()\n", "get_specific_crossrefs(False, 'verse', 'LCS', 60, 'crossrefs_lcs_db.txt')\n", "do_all_chunks()\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[ ]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "HTML(ecss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8. Overview of the similarities\n", "\n", "Here are the plots of two similarity matrices\n", "* with verses as chunks and SET as similarity method\n", "* with verses as chunks and LCS as similarity method\n", "\n", "Horizontally you see the degree of similarity from 0 to 100%, vertically the number of pairs that have that (rounded) similarity. This axis is logarithmic." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[ ]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "do_experiment(False, \"verse\", \"SET\", 60, False)\n", "distances = collections.Counter()\n", "for (x, d) in chunk_dist.items():\n", " distances[int(round(d))] += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "x = range(MATRIX_THRESHOLD, 101)\n", "fig = plt.figure(figsize=[15, 4])\n", "plt.plot(x, [math.log(max((1, distances[y]))) for y in x], \"b-\")\n", "plt.axis([MATRIX_THRESHOLD, 101, 0, 15])\n", "plt.xlabel(\"similarity as %\")\n", "plt.ylabel(\"log # similarities\")\n", "plt.xticks(x, x, rotation=\"vertical\")\n", "plt.margins(0.2)\n", "plt.subplots_adjust(bottom=0.15)\n", "plt.title(\"distances\")" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "In[ ]:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "do_experiment(False, \"verse\", \"LCS\", 60, False)\n", "distances = collections.Counter()\n", "for (x, d) in chunk_dist.items():\n", " distances[int(round(d))] += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "lines_to_next_cell": 2 }, "outputs": [], "source": [ "x = range(MATRIX_THRESHOLD, 101)\n", "fig = plt.figure(figsize=[15, 4])\n", "plt.plot(x, [math.log(max((1, distances[y]))) for y in x], \"b-\")\n", "plt.axis([MATRIX_THRESHOLD, 101, 0, 15])\n", "plt.xlabel(\"similarity as %\")\n", "plt.ylabel(\"log # similarities\")\n", "plt.xticks(x, x, rotation=\"vertical\")\n", "plt.margins(0.2)\n", "plt.subplots_adjust(bottom=0.15)\n", "plt.title(\"distances\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In[ ]:" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.0" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }