{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Unicode normalization\n", "\n", "Many complex Unicode characters can be expressed in more than one way. For example, an uppercase A with a superscript ring can be expressed as a single Unicode character U+212b or as a combination of a base uppercase A (U+0041) followed by a combining superscript ring (U+030a). The documents you need to collate may contain alternative representations that you would like to treat as identical for collation purposes. To do that, you can create a shadow \"n\" property in your pretokenized JSON (see Unit 5 of this workshop) and normalize the strings. Here’s how to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To show that the two representations are different at the underlying byte level but look alike to a human, we create two variables, which we call `a` and `b`, and assign one representation to each, which we then print so that we can examine how they look." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = \"\\u212b\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Å'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "b = \"\\u0041\\u030a\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Å'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we check these two values for equality, Python tells us that they are not equal:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a == b" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "We can use the unicodedata.normalize() function to normalize both variables and then compare them. When we do that, the normalized versions are equal:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import unicodedata\n", "unicodedata.normalize('NFC',a) == unicodedata.normalize('NFC',b)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We’ve performed this normalization by itself so that we can examine the results. In a CollateX context, you would incorporate Unicode normalization as part of the process of generating an \"n\" property for your pretokenized JSON tokens. See for more information about Unicode normalization forms." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }