{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Unicode normalization\n",
    "\n",
    "Many complex Unicode characters can be expressed in more than one way. For example, an uppercase A with a superscript ring can be expressed as a single Unicode character U+212b or as a combination of a base uppercase A (U+0041) followed by a combining superscript ring (U+030a). The documents you need to collate may contain alternative representations that you would like to treat as identical for collation purposes. To do that, you can create a shadow \"n\" property in your pretokenized JSON (see Unit 5 of this workshop) and normalize the strings. Here’s how to do that."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To show that the two representations are different at the underlying byte level but look alike to a human, we create two variables, which we call `a` and `b`, and assign one representation to each, which we then print so that we can examine how they look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "a = \"\\u212b\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Å'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "b = \"\\u0041\\u030a\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Å'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we check these two values for equality, Python tells us that they are not equal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a == b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "We can use the unicodedata.normalize() function to normalize both variables and then compare them. When we do that, the normalized versions are equal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import unicodedata\n",
    "unicodedata.normalize('NFC',a) == unicodedata.normalize('NFC',b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We’ve performed this normalization by itself so that we can examine the results. In a CollateX context, you would incorporate Unicode normalization as part of the process of generating an \"n\" property for your pretokenized JSON tokens. See <http://unicode.org/reports/tr15/> for more information about Unicode normalization forms."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}