{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Near matching\n", "\n", "**The example below works with 2.1 (the current release), but not 2.0**\n", "\n", "## What is near matching and why do we use it?\n", "\n", "*Near matching* is another term for *fuzzy matching*, that is, is based on the idea that two items (such as two word tokens in a collation) should sometimes be considered matching even when they are not *string equal* (that is, not identical in every character). More precisely, near matching is a strategy for finding the closest match in situations where they not be an exact match. \n", "\n", "Consider the following example from the _Rus′ primary chronicle_:\n", "\n", "\n", "\n", "The last column contains slightly differing forms of **fraci**, but the second witness, _Tro_, reads **fraki**. Normalization, including Soundex takes care of the slight variation during Gothenburg stage 2, but we don’t want to merge **c** and **k** globally because the difference between them is usually significant.\n", "\n", "When CollateX cannot find an exact match and there is more than one possible alignment for a token, it defaults to placing the token to the left. This means that without near matching, **fraki**, which does match perfectly either **i** or **fraci**, would be misaligned. With near matching, though, CollateX can recognize that **fraci** is more like **fraki** than it is like **i**, and thus place it in the correct (right) column.\n", "\n", "## How does near matching work in CollateX?\n", "\n", "The way near matching currently works in CollateX is that if the user has turned it on (it is off by default), after the *alignment* stage has been completed, the system looks for situations that cannot be resolved solely by exact matching, that is, entirely at the alignment stage. Those situations must show both of the following properties:\n", "\n", "1. **Different numbers of tokens:** Between two blocks (vertical sets of tokens) that are firmly aligned there is an unequal number of tokens in the witnesses. In the example above, the alignment of “The” and “koala” is clear because both witnesses have the identical reading (that is, they are complete vertical blocks), but in one witness there is one token between them and in the other witness there are two tokens. If, on the other hand, the two witnesses read “The gray koala” and “The grey koala”, although the middle tokens don’t match, there’s no ambiguity because the alignment is forced: since “gray” and “grey” are sandwiched between perfect matches before and after, they can only be aligned with each other, so there is no need for near matching.\n", "2. **No exact match:** The tokens in the shorter witnesses do not have an exact string match in the longer witnesses. If they did have an exact match, that would have been found at the *alignment* stage and there would be no need for near matching. For example, if the two witnesses read “The grey koala” and “The big grey koala”, although there are three tokens in the first witness and four in the second, each token in the first has an exact string match in the second, which means that it can be aligned at the *alignment* stage.\n", "\n", "If and only if both of those conditions are met, CollateX compares the floating token in the shorter witness (“gray” in the example above) to all possible matches (“big” and “grey” in this example) and calculates the nearest match using a measure called *edit distance* or *Levenshtein distance* (see https://en.wikipedia.org/wiki/Edit_distance for more information). CollateX calculates the edit distance between the floating token and the tokens in the other witnesses at all of the locations where it could be placed, and determines the best placement. In a tradition with a large number of witnesses and large gaps, the number of comparisons grows quickly, which means that you don’t want to calculate edit distance except where you need to. Performing computationally inexpensive exact string matching first (in the *alignment* stage) and then calculating the more expensive edit distance (in the *analysis* stage) only where *alignment* has failed to give a satisfactory answer reduces the amount of computation required." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+-----+------+------+-------+\n", "| A | The | gray | - | koala |\n", "| B | The | big | grey | koala |\n", "+---+-----+------+------+-------+\n" ] } ], "source": [ "from collatex import *\n", "collation = Collation()\n", "collation.add_plain_witness(\"A\", \"The gray koala\")\n", "collation.add_plain_witness(\"B\", \"The big grey koala\")\n", "alignment_table = collate(collation, segmentation=False)\n", "print(alignment_table)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+-----+-----+------+-------+\n", "| A | The | - | gray | koala |\n", "| B | The | big | grey | koala |\n", "+---+-----+-----+------+-------+\n" ] } ], "source": [ "from collatex import *\n", "collation = Collation()\n", "collation.add_plain_witness(\"A\", \"The gray koala\")\n", "collation.add_plain_witness(\"B\", \"The big grey koala\")\n", "alignment_table = collate(collation, segmentation=False, near_match=True)\n", "print(alignment_table)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }