{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced normalization\n", "\n", "* CollateX default matching\n", "* Why you may want to override it\n", "* How to override it\n", "\n", "## CollateX default matching\n", "\n", "* Exact string matching – Near matching\n", "* Tokenize by splitting on white space\n", "* Punctuation marks are individual tokens\n", "* No case normalization\n", "* No Unicode normalization\n", "\n", "## Sample normalization overrides\n", "* Case folding\n", "* Unicode normalization (precomposed characters)\n", "* Strip punctuation\n", "* Strip markup\n", "\n", "## Soundex\n", "\n", "* English-language surnames, 1918\n", "* Algorithm (simplified)\n", " 1. Retain first letter\n", " 1. Delete other vowels\n", " 1. Degeminate\n", " 1. Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)\n", " 1. Truncate or zero-pad to four characters\n", "* Examples\n", " * *Birnbaum* B-651 (also ✓*Barenboim*; also ✗*Brumble*)\n", "\n", "## Soundex assumptions\n", "\n", "* More nuanced than generic edit distance\n", " * Edit distance (Levenshtein distance): *deletion*, *insertion*, *substitution* (Damerau-Levenshtein: *transposition*)\n", "* Character differences are not all equivalent with respect to information load\n", " * Consonants carry more information than vowels\n", "* Information load may be sensitive to position\n", " * Beginning of word carries more information than end\n", " * Especially true for lexical (not morphological) searching in inflected languages\n", " \n", "## Adapting Soundex to Church Slavonic\n", "\n", "* Neutralize variant spellings of initial vowel\n", " * оу,у,ꙋ=у\n", " * ѡ,ꙍ,ѻ,о=о\n", "* Casefold, neutralize consonantal variants\n", " * Not always one-to-one, e.g., щ = шт\n", "* Degeminate, delete other vowels, delete diacritics\n", " * Keep two letters of two-letter words\n", " * Higher information load\n", "* Other conflations?\n", " * Knowledge based vs machine learning\n", "* Expand abbreviations? – б҃га, бг҃а, б҃а = бога (бг)\n", " * Truncate\n", " * Zero-pad\n", " * To what length?\n", "\n", "## Two types of normalization\n", "\n", "### Collation\n", "\n", "* Find alignment points\n", "* Coarse adjustments\n", "* No harm in conflating, e.g., imperfect and aorist or infinitive and supine\n", "\n", "### Evaluation\n", "\n", "* Alignment points are already known\n", "* Finer comparisons\n", "* Many need to distinguish on the basis of small details\n", "\n", "## Collation after Soundex\n", "\n", "* Greatly improved results\n", "* Utilize forced matches\n", " * A B C\n", " * A D C\n", "* Misses\n", " * Gap in alignment (no forced match)\n", " * Imperfect match\n", " * фраки ~ фраци\n", " * CollateX recognizes only perfect matches\n", " * Unable to recognize closest match (but see *near matching*)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }