{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advanced normalization\n",
    "\n",
    "* CollateX default matching\n",
    "* Why you may want to override it\n",
    "* How to override it\n",
    "\n",
    "## CollateX default matching\n",
    "\n",
    "* Exact string matching – Near matching\n",
    "* Tokenize by splitting on white space\n",
    "* Punctuation marks are individual tokens\n",
    "* No case normalization\n",
    "* No Unicode normalization\n",
    "\n",
    "## Sample normalization overrides\n",
    "* Case folding\n",
    "* Unicode normalization (precomposed characters)\n",
    "* Strip punctuation\n",
    "* Strip markup\n",
    "\n",
    "## Soundex\n",
    "\n",
    "* English-language surnames, 1918\n",
    "* Algorithm (simplified)\n",
    "    1. Retain first letter\n",
    "    1. Delete other vowels\n",
    "    1. Degeminate\n",
    "    1. Conflate other letters according to phonetic similarity (e.g., t/d = 3; m/n = 5)\n",
    "    1. Truncate or zero-pad to four characters\n",
    "* Examples\n",
    "    * *Birnbaum* B-651 (also ✓*Barenboim*; also ✗*Brumble*)\n",
    "\n",
    "## Soundex assumptions\n",
    "\n",
    "* More nuanced than generic edit distance\n",
    "    * Edit distance (Levenshtein distance): *deletion*, *insertion*, *substitution* (Damerau-Levenshtein: *transposition*)\n",
    "* Character differences are not all equivalent with respect to information load\n",
    "    * Consonants carry more information than vowels\n",
    "* Information load may be sensitive to position\n",
    "    * Beginning of word carries more information than end\n",
    "    * Especially true for lexical (not morphological) searching in inflected languages\n",
    "    \n",
    "## Adapting Soundex to Church Slavonic\n",
    "\n",
    "* Neutralize variant spellings of initial vowel\n",
    "    * оу,у,ꙋ=у\n",
    "    * ѡ,ꙍ,ѻ,о=о\n",
    "* Casefold, neutralize consonantal variants\n",
    "    * Not always one-to-one, e.g., щ = шт\n",
    "* Degeminate, delete other vowels, delete diacritics\n",
    "    * Keep two letters of two-letter words\n",
    "    * Higher information load\n",
    "* Other conflations?\n",
    "    * Knowledge based vs machine learning\n",
    "* Expand abbreviations? –  б҃га, бг҃а, б҃а = бога (бг)\n",
    "    * Truncate\n",
    "    * Zero-pad\n",
    "    * To what length?\n",
    "\n",
    "## Two types of normalization\n",
    "\n",
    "### Collation\n",
    "\n",
    "* Find alignment points\n",
    "* Coarse adjustments\n",
    "* No harm in conflating, e.g., imperfect and aorist or infinitive and supine\n",
    "\n",
    "### Evaluation\n",
    "\n",
    "* Alignment points are already known\n",
    "* Finer comparisons\n",
    "* Many need to distinguish on the basis of small details\n",
    "\n",
    "## Collation after Soundex\n",
    "\n",
    "* Greatly improved results\n",
    "* Utilize forced matches\n",
    "    * A B C\n",
    "    * A D C\n",
    "* Misses\n",
    "    * Gap in alignment (no forced match)\n",
    "    * Imperfect match\n",
    "        * фраки ~ фраци\n",
    "    * CollateX recognizes only perfect matches\n",
    "    * Unable to recognize closest match (but see *near matching*)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}