{ "metadata": { "name": "", "signature": "sha256:b3444aabd58043b4d0321ef51c48535e1d2cf95aca56222cfcf18492a4f077e4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Yet Another Python Encoding Tutorial" ] }, { "cell_type": "heading", "level": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Guillermo Moncecchi" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "This notebook shows how to manage text encodings with Python. It is mostly based in the [Unicode HOWTO](http://docs.python.org/2/howto/unicode), and the [codecs](http://docs.python.org/2/library/codecs.html) library module from the Python documentation. I have decided to start with these notes because handling encodings seems a pretty difficult task if you do not understand what Unicode is and how it works and how the are different ways of encode the same string. " ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Characters" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Characters are the smallest units of a text. Usually, we think about letters, but everything that can have its own glyph (i.e., drawing), such as a number, a math symbol or a greek letter, is a character. A _character table_ is a list of numerical values for a certain list of characters. The most ancient character table is the well known ASCII table. ASCI defined numeric codes for 128 characters (from 0 to 127), mostly including English letters and numbers, but not (for example) accented characters. ASCII characters have an important feature: they use only 7 bits, so they can be always represented by a byte. In fact, another common encoding, Latin-1 (or ISO-8859-1), uses 8 bits, encoding 256 characters using just a byte. The first 128 characters of Latin-1 are the same as characters of ASCII. \n", "\n", "Python 2 strings are ASCII strings" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "print sys.getdefaultencoding()\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ascii\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you write the following code in a Python script (let's call it encoding_test.py)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "s='\u20ac';print s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... you will get an error like this:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "SyntaxError: Non-ASCII character '\\xe2' in file encoding_test.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... because you are trying to specify a non-ascii character in a Python 2 string (in the next section we will see how to fix this). " ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Encoding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you have a text, and want to display it in your system terminal (i.e. draw it's glyph), or save it in a IO device. You will need to represent each character in your text using bits grouped in bytes (because that is all what computers can understand). This is (almost) trivial for ASCII or Latin-1: just represent the character with the numerical code, using the byte's 8 bits. But, what should we do with characters beyond Latin-1? You have no option: assign them a new code, and use more than one byte to represent it. The different forms to represent characters using bytes are called _encodings_ (I will not delve here in _how_ they encode each character, what I want to show is how Python manages the encoding).\n", "\n", "The first thing we must understand is that every text document (including Python source files!) is encoded using a certain encoding. _You cannot surely know the encoding inspecting the file contents_, simply because the encoding is a way to transform characters to bytes... the same file will look different using different encodings. However, there are some commands (in linux, for example the enca command) that try to guess it (using \"a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings\"). So, generally speaking, you should better know which encoding the file you want to read uses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, if this notebook shows correctly the \u20ac sign when we print a variable value, your default system Locale is using UTF-8 (which is a popular encoding). Let's verify it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import locale\n", "print locale.getpreferredencoding()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "UTF-8\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "I have created four different text files, saving each one (using my text editor) with different encodings. Let's see how this notebook display them. First, a simple ASCII file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat ../data/ASCII_file.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is a plain ascii file. It does not matter which encoding it uses." ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "ASCII files will always look OK (no matter which encoding your terminal or editor is using), because (as far as I know), every encoding uses the same codes for the first 127 characters (the ASCII code). Now, let's try with Latin-1:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat ../data/latin_1_file.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is a latin-1 encoded file. \r\n", "It includes some accented characters: educaci\ufffdn alegr\ufffda c\ufffdmara. \r\n", "La tercera oraci\ufffdn est\ufffd en espa\ufffdol.\r\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start to have some problems. Since our terminal uses UTF-8, and the file encodes accented symbols with Latin-1, the terminal cannot display them, _because Latin-1 and UTF-8 have different encodings for characters between 128 and 255_. If we encoded the file using Mac OS Roman encoding (the encoding Mac OS used to have), we will have similar problems:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat ../data/mac_os_roman_file.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is an Mac OS Roman. Some accented characters: c\ufffdmara alegr\ufffda ilusi\ufffdn" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, if we display an utf-8 encoded file, we have no problems:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat ../data/utf-8_file.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is an UTF-8. \r\n", "It includes a pair of japanese characters: \u6771\u4eac (Tokyo). \r\n", "The euro sign is this: \u20ac. \r\n", "Some accented characters: c\u00e1mara alegr\u00eda ilusi\u00f3n.\r\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that this happens because the terminal encoding matches the file encoding, there is no 'correct' encoding! UTF-8 is becoming more and more popular (and Python 3 strings are utf-8 encoded by default), but remember that it is just another encoding. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How could we include (for example) chinese characters in Python source? [PEP-263](http://legacy.python.org/dev/peps/pep-0263/) is the answer: you should specify an encoding as the first (or second) line of your source file: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "# coding:utf-8\n", "tokyo='\u6771\u4eac'\n", "print tokyo\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u6771\u4eac\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python is still saving ascii strings, but is converting what you wrote to bytes (using the encoding you specified). Remember, declaring the encoding does not means that your source it is actually encoded by your OS using the referenced encoding. You sould save your file in the declared format, using (for example) your text editor. " ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Unicode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember this: Unicode is *not* an encoding. Unicode is a character table, just like ASCII. Unicode uses 16 bits for each character code (actually, modern versions include more), i.e. you have 2^16 = 65,536 distinct values available. Python includes a library, unicodedata, to display information about unicode characters, and their numeric codes (called code points, and usually represented using base 16). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "import unicodedata\n", "u=unichr(8364)\n", "print u,ord(u),unicodedata.category(u),unicodedata.name(u)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u20ac 8364 Sc EURO SIGN\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "One important (and confusing...) characteristic of Unicode, is that the first 127 characters codes are the same than those of ASCII. To tell Python that you are specifiying a Unicode string, precede the string with a u, and specify the non-ascii character using its Unicode code points value, using 4 hex digits (you can specify it with 8 hex digits, using \\U):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "u=u'This string includes an \\u20AC sign'\n", "print u" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This string includes an \u20ac sign\n" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "People often confuse (just take a look at Stackoverflow) Unicode and UTF-8, but they are different things. UTF-8 (as we said) is an _encoding_ and Unicode is a list of codes representing characters. When you build Unicode string (exactly as you did with regular strings) Python allows you to specify them using non-ascii characters if you specified an encoding:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# coding:utf-8\n", "u=u'This string includes an \u20ac sign'\n", "print u" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This string includes an \u20ac sign\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "... but this does NOT changes its internal representation. What _do_ happens is that utf-8 can encode _every_ Unicode character, something that latin-1 obviously can't (why?)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The recommendation for handling character sequences in Python is to always use Unicode string for processing, to fix a standard code (that is, after all, what Unicode was invented for). The unicode constructor allows to build a unicode type from a string. If you specify no encoding, the constructor will asume you are passing an ascii-encoded string (and it will fail miserably trying to convert a character whose code is beyond 127):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "u=unicode('This strings is ascii')\n", "print u, type(u)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This strings is ascii \n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "u=unicode('This string contains \\xE2\\x82\\xAC', encoding='utf-8')\n", "print u,type(u)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This string contains \u20ac \n" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "u=unicode('This string contains \\xE2\\x82\\xAC')\n", "print u,type(u)\n" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "UnicodeDecodeError", "evalue": "'ascii' codec can't decode byte 0xe2 in position 21: ordinal not in range(128)", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mu\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0municode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'This string contains \\xE2\\x82\\xAC'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0mu\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mu\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xe2 in position 21: ordinal not in range(128)" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "u=unicode('This string contains \u20ac')\n", "print u,type(u)\n" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "UnicodeDecodeError", "evalue": "'ascii' codec can't decode byte 0xe2 in position 21: ordinal not in range(128)", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mu\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0municode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'This string contains \u20ac'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0mu\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mu\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xe2 in position 21: ordinal not in range(128)" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a very common error. You should really be sure that you understand what is happening. In the first example, we simply pass a 8-bit string, and Python creates a unicode value, assuming ascii. In the second, we specify the Euro sign using hexa values, and tell Python that the 8-bit ascii string we passed is actually encoding sometingh using utf-8. In the third case, Python tries to create the Unicode character, but when he comes to character '\\xe2' (that is, a decimal value of 226), discover that it is an invalid ascii value, and fails. This is *completely independent* if the encoding you are using for your source file. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most string functions still work if you pass Unicode sequences instead of strings. The advantage is that, once you encoded the Unicode sequence, you forget about bytes and encodings. If you ask for the length of a sequence, you are asking for the number of Unicode code values, which is encoding-independent (something that does not happen with bytes, since, for characters over 256, you need more than one byte to encode them). So, the general advice is: convert your string to Unicode before working with them. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "UTF-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the [wikipedia](http://en.wikipedia.org/wiki/UTF-8):\n", "\n", " \n", "UTF-8 (UCS Transformation Format\u20148-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.\n", "\n", "UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an \"octet\" in the Unicode Standard). Code points with lower numerical values (i.e. earlier code positions in the Unicode character set, which tend to occur more frequently) are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.\n", "\n", "\n", "\n", "Some features of the UTF-8 encoding:\n", "\n", "One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0.\n", "Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position.\n", "The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence, so that the length of the sequence can be determined without examining the continuation bytes.\n", "The remaining bits of the encoding are used for the bits of the code point being encoded, padded with high-order 0s if necessary. The high-order bits go in the lead byte, lower-order bits in succeeding continuation bytes. The number of bytes in the encoding is the minimum required to hold all the significant bits of the code point.\n", "Single bytes, leading bytes, and continuation bytes do not share values. This makes the scheme self-synchronizing, allowing the start of a character to be found by backing up at most five bytes (three bytes in actual UTF\u20118 as explained below).\n", "" ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Input/Output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most problems with encodings arise when you do input/output. The purpose of this section is to explain what happens when you read/write strings from a file, and how to avoid strange encoding errors. Let me cite the Unicode HOWTO:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It\u2019s possible to do all the work yourself: open a file, read an 8-bit string from it, and convert the string with unicode(str, encoding). However, the manual approach is not recommended._\n", "\n", "_One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1K or 4K), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk._\n", "\n", "_To read bytes from a file and convert them to Unicode characters, the **codecs** module implements an open() function that returns a file-like object that assumes the file\u2019s contents are in a specified encoding and accepts Unicode parameters for methods such as .read() and .write()_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by reading an ascii file, using the traditional file.open() function" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ascii_file =open('../data/ASCII_file.txt')\n", "for line in ascii_file:\n", " print line" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is a plain ascii file. It does not matter which encoding it uses.\n" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, try to open the utf-8 file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "utf_8_file =open('../data/utf-8_file.txt')\n", "for line in utf_8_file:\n", " print line, type(line)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is an UTF-8. \n", "\n", "It includes a pair of japanese characters: \u6771\u4eac (Tokyo). \n", "\n", "The euro sign is this: \u20ac. \n", "\n", "Some accented characters: c\u00e1mara alegr\u00eda ilusi\u00f3n.\n", "\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First observation. You _can_ use open on encoded files. Open just considers them a stream of bytes. If you have the correct encoding in your terminal, the string will be displayed correctly. But beware that you are working with bytes, not with Unicode. That is not what is recommended, for the reasons previously mentioned. Let's see what happens if we try to open/display the Latin 1 file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "latin_1_file =open('../data/latin_1_file.txt')\n", "for line in latin_1_file:\n", " print line" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is a latin-1 encoded file. \n", "\n", "It includes some accented characters: educaci\ufffdn alegr\ufffda c\ufffdmara. \n", "\n", "La tercera oraci\ufffdn est\ufffd en espa\ufffdol.\n", "\n" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not working. Actually, it _is_ working: it just reads bytes, and display them... using utf-8, not latin-1. We can try to decode each line, specifying the encoding, and converting them to Unicode: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "latin_1_file =open('../data/latin_1_file.txt')\n", "for line in latin_1_file:\n", " print line.decode('latin-1')\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is a latin-1 encoded file. \n", "\n", "It includes some accented characters: educaci\u00f3n alegr\u00ed\u00ada c\u00e1mara. \n", "\n", "La tercera oraci\u00f3n est\u00e1 en espa\u00f1ol.\n", "\n" ] } ], "prompt_number": 46 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's try to read the files into Unicode, using codecs.open instead of the standard open function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import codecs\n", "utf_8_file =codecs.open('../data/UTF-8_file.txt', encoding='utf-8')\n", "for line in utf_8_file:\n", " print type(line),line\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " This is an UTF-8. \n", "\n", " It includes a pair of japanese characters: \u6771\u4eac (Tokyo). \n", "\n", " The euro sign is this: \u20ac. \n", "\n", " Some accented characters: c\u00e1mara alegr\u00eda ilusi\u00f3n.\n", "\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that now the type has changed. We are working with Unicode sequences; that is, we have a format-independent representations. That is what we wanted! The codecs library includes and encode() function that (you guessed it, didn't you?) that allows to change encoding. For example, let's first encode an Unicode string using the Unicode codes for accented letters (note that, specifiying the string this way, we do not have to worry about the encoding): " ] }, { "cell_type": "code", "collapsed": false, "input": [ "s=u'Alegr\\u00EDa c\\u00E1mara ilusi\\u00F3n'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 48 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to show it, Python will use the encoding for our system (in our case UTF-8):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print s, type(s)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Alegr\u00eda c\u00e1mara ilusi\u00f3n \n" ] } ], "prompt_number": 49 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If, for some reason, we want to encode it using Latin-1, we just use codecs.encode (which would not look very well in our display):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "s1=s.encode('latin-1')\n", "print s1,type(s1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Alegr\ufffda c\ufffdmara ilusi\ufffdn \n" ] } ], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Note that encode returns an 8-bit string, while decode returns an Unicode string). We can try to encode it using Python default encoding (ascii)... and fail" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print s.encode()" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "UnicodeEncodeError", "evalue": "'ascii' codec can't encode character u'\\xed' in position 5: ordinal not in range(128)", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode character u'\\xed' in position 5: ordinal not in range(128)" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "By now, you should understand why this happened. There is no way to represent '\u00e1' using ascii, simply because it is not in its code list. This happens with every encoding pairs. For example, let's try to encode the kanji for Tokyo using Latin-1:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# coding:utf-8\n", "tokyo=u'This is \u6771\u4eac, boy!'\n", "print tokyo.encode('latin-1')\n" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "UnicodeEncodeError", "evalue": "'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mtokyo\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34mu'This is \u6771\u4eac, boy!'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mtokyo\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'latin-1'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mUnicodeEncodeError\u001b[0m: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most encoding problems are reduced to this: _you cannot encode what you cannot encode_. If a symbol is not present in an encoding you simply cannot encode it. The encode() function offers us some ways to manage this, through the errors parameter. The default value ('strict') produces the previous error when it can't encode a character. But we can specify 'ignore' to simply ignore those characters it cannot manage, 'replace' to use replacement character, 'xmlcharrefreplace' to use XML character reference for replacement, o 'backslashreplace' to replace with backslashed escape sequences" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print tokyo.encode('latin-1','ignore')\n", "print tokyo.encode('latin-1','replace')\n", "\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is , boy!\n", "This is ??, boy!\n" ] } ], "prompt_number": 36 }, { "cell_type": "code", "collapsed": false, "input": [ "print tokyo.encode('latin-1','backslashreplace')\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "This is \\u6771\\u4eac, boy!\n" ] } ], "prompt_number": 37 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A real world example: tagging spanish texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, I will try to use the [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) to parse some text (using the [treetagger Python module](https://github.com/miotto/treetagger-python), based on the TagggerI class from NLTK):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# You must have NLTK installed, treetagger installed and the TREETAGGER_HOME env variable set for this to work\n", "# export TREETAGGER_HOME='/path/to/your/TreeTagger/'\n", "#\n", "from treetagger import TreeTagger\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking the documentation, we find that the TreeTagger can tag latin-1 and utf-8 texts. We must also specify the language, but this has nothing to do with the encoding, but with how the tagger was trained... First, let's tag an English text, to check that everything is working (the english version of the TreeTagger only accepts latin-1 encoded strings). " ] }, { "cell_type": "code", "collapsed": false, "input": [ "tt=TreeTagger(language='english',encoding='latin-1')\n", "tagged_sent=tt.tag('What is the airspeed of an unladen swallow? And what about the \u20ac sign?')\n", "print tagged_sent\n", "print '\\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[u'What', u'WP', u'What'], [u'is', u'VBZ', u'be'], [u'the', u'DT', u'the'], [u'airspeed', u'NN', u'airspeed'], [u'of', u'IN', u'of'], [u'an', u'DT', u'an'], [u'unladen', u'JJ', u''], [u'swallow', u'NN', u'swallow'], [u'?', u'SENT', u'?'], [u'And', u'CC', u'and'], [u'what', u'WP', u'what'], [u'about', u'IN', u'about'], [u'the', u'DT', u'the'], [u'\\u20ac', u'JJ', u''], [u'sign', u'NN', u'sign'], [u'?', u'SENT', u'?']]\n", "\n", "Readable version:What/WP is/VBZ the/DT airspeed/NN of/IN an/DT unladen/JJ swallow/NN ?/SENT And/CC what/WP about/IN the/DT \u20ac/JJ sign/NN ?/SENT\n" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see in detail what happened here. First, the tt.tag() received a Python string (what, given we are using Python 2, is an ASCII string, i.e. it uses only 7 bits for each character). Even when we specified (using our environment encoding, i.e. utf-8) a '\u20ac' sign, it is encoded using 3 bytes (the 3 bytes of its utf-8 representation):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print len('\u20ac')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3\n" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the difference between '$' (an symbol from the ascii table) and '\u20ac' (by now, you should anticipate the answer):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print len('What is the airspeed of an unladen swallow? And what about the \u20ac sign?')\n", "print len('What is the airspeed of an unladen swallow? And what about the $ sign?')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "72\n", "70\n" ] } ], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "After that, the byte stream is sent to the TreeTagger, which analizes it and returns its results. This tagger does not know of Unicode: it analyzes words (i.e. byte streams) and returns their tags, depending on its training corpus (which can be differently encoded). The python module gets this output and converts it to Unicode strings, _using utf-8_ (I had to check the module source code to find that). That is why the Euro sign is recovered in the output! But keep in mind that the tagger did not find a Unicode character, but three bytes when it saw the Euro symbols. Now, let's do exactly the same, but explicitly telling the tagger that we have wrote a Unicode string:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tagged_sent=tt.tag(u'What is the airspeed of an unladen swallow? And what about the \u20ac sign?')\n", "print tagged_sent\n", "print '\\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])\n" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "UnicodeEncodeError", "evalue": "'latin-1' codec can't encode character u'\\u20ac' in position 63: ordinal not in range(256)", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtagged_sent\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtag\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mu'What is the airspeed of an unladen swallow? And what about the \u20ac sign?'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0mtagged_sent\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0;34m'\\nReadable version:'\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;34m' '\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;34m'/'\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0mpos\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mpos\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mlemma\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtagged_sent\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/guillermo/Dropbox/fing/work/datascience/src/treetagger.pyc\u001b[0m in \u001b[0;36mtag\u001b[0;34m(self, sentences)\u001b[0m\n\u001b[1;32m 136\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 137\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_input\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0municode\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 138\u001b[0;31m \u001b[0m_input\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_input\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mencoding\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 139\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 140\u001b[0m \u001b[0;31m# Run the tagger and get the output\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mUnicodeEncodeError\u001b[0m: 'latin-1' codec can't encode character u'\\u20ac' in position 63: ordinal not in range(256)" ] } ], "prompt_number": 32 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops. Now, the Python module tries, before sending input to the tagger, to encode it (he needs to, since the Tagger is accessed through a system pipe, i.e. cannot understand Unicode), and it uses the specified encoding (latin-1). But the Euro sign is not part of the latin-1 character table, and the encoding fails. Can we circumvent this? No, we cannot. Unless not using this module. We should have an utf-8 version of the tagger, something we do not have. Let's see what happens with the utf-8 version of the spanish tagger:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tt=TreeTagger(language='spanish',encoding='utf8')\n", "tagged_sent=tt.tag(u'\u00bfPodremos taggear esto? \u00bfY qu\u00e9 pasa con el signo de \u20ac? ')\n", "print '\\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])\n", "\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Readable version:\u00bf/FS Podremos/VLfin taggear/VLinf esto/DM ?/FS \u00bf/FS Y/CC qu\u00e9/INT pasa/VLfin con/PREP el/ART signo/NC de/PREP \u20ac/NC ?/FS\n" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, everything is working. The tagger is receiving utf-8 encoded strings, and its results are converted back to Unicode. It does not matter if we do not specify that the input is Unicode, because our system uses utf-8, and the results are the same:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tt=TreeTagger(language='spanish',encoding='utf8')\n", "tagged_sent=tt.tag('\u00bfPodremos taggear esto? \u00bfY qu\u00e9 pasa con el signo de \u20ac? ')\n", "print '\\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Readable version:\u00bf/FS Podremos/VLfin taggear/VLinf esto/DM ?/FS \u00bf/FS Y/CC qu\u00e9/INT pasa/VLfin con/PREP el/ART signo/NC de/PREP \u20ac/NC ?/FS\n" ] } ], "prompt_number": 34 }, { "cell_type": "markdown", "metadata": {}, "source": [ "One last test: let's tag a Bulgarian sentence (we can only use utf-8 for this, since its character codes are not within latin-1):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# There is a bug in the Python treetagger module\n", "# Since it assumes that when a language has only one encoding, is latin-1\n", "# Bulgarian, for example, only allows utf-8\n", "# We tell the module that we are encoding latin\n", "tt=TreeTagger(language='bulgarian',encoding='latin-1')\n", "tagged_sent=tt.tag('\u0422\u043e\u0432\u0430 \u0435 \u043c\u043e\u044f\u0442 \u0434\u043e\u043c')\n", "print '\\nReadable version:'+' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])\n", "\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Readable version:\u0422\u043e\u0432\u0430/Pde-os-n \u0435/Vxitf-r3s \u043c\u043e\u044f\u0442/Psol-s1mf \u0434\u043e\u043c/Ncmsi\n" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's try to read sentences from differently encoded files, tag them, and show their utf-8-encoded versions in out terminal. Let's first read the latin-1 file, using the codecs package, and tag its third sentence (written in Spanish):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import codecs\n", "tt=TreeTagger(language='spanish',encoding='utf8')\n", "f = codecs.open('../data/latin_1_file.txt', encoding='latin-1')\n", "sents=f.readlines()\n", "spanish_sent=sents[2]\n", "tagged_sent=tt.tag(spanish_sent)\n", "print type(spanish_sent),spanish_sent, ' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " La tercera oraci\u00f3n est\u00e1 en espa\u00f1ol.\n", "La/ART tercera/ORD oraci\u00f3n/NC est\u00e1/VEfin en/PREP espa\u00f1ol/NC ./FS\n" ] } ], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's process the utf-8 file the same way (of course, is a nonsense to tag English tokens with a Spanish taggers, but we want to check that the utf-8 reading works as expected:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "f = codecs.open('../data/utf-8_file.txt', encoding='utf-8')\n", "sents=f.readlines()\n", "for sent in sents:\n", " tagged_sent=tt.tag(sent)\n", " print '\\n',type(sent),sent, ' '.join([word+'/'+pos for (word,pos,lemma) in tagged_sent])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", " This is an UTF-8. \n", "This/NP is/PE an/PE UTF-8/NC ./FS\n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " It includes a pair of japanese characters: \u6771\u4eac (Tokyo). \n", "It/NC includes/VLfin a/PREP pair/NC of/PE japanese/NC characters/PE :/COLON \u6771\u4eac/NC (/LP Tokyo/NP )/RP ./FS\n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " The euro sign is this: \u20ac. \n", "The/PE euro/NC sign/PE is/PE this/NC :/COLON \u20ac/NC ./FS\n", "\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " Some accented characters: c\u00e1mara alegr\u00eda ilusi\u00f3n.\n", "Some/NP accented/VLfin characters/PE :/COLON c\u00e1mara/NC alegr\u00eda/NC ilusi\u00f3n/NC ./FS\n" ] } ], "prompt_number": 37 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Python 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us cite part of the [What\u2019s New In Python 3.0](http://docs.python.org/3.0/whatsnew/3.0.html). If we did well, you should understand everything\n", "\n", "\n", "- Text Vs. Data Instead Of Unicode Vs. 8-bit\n", "\n", "Everything you thought you knew about binary data and Unicode has changed.\n", "\n", "- Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.\n", "- As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text. To be prepared in Python 2.x, start using unicode for all unencoded text, and str for binary or encoded data only. Then the 2to3 tool will do most of the work for you.\n", "- You can no longer use u\"...\" literals for Unicode text. However, you must use b\"...\" literals for binary data.\n", "- All backslashes in raw string literals are interpreted literally. This means that '\\U' and '\\u' escapes in raw strings are not treated specially. For example, r'\\u20ac' is a string of 6 characters in Python 3.0, whereas in 2.6, ur'\\u20ac' was the single \u201ceuro\u201d character. (Of course, this change only affects raw string literals; the euro character is '\\u20ac' in Python 3.0.)\n", "- Files opened as text files (still the default mode for open()) always use an encoding to map between strings (in memory) and bytes (on disk). Binary files (opened with a b in the mode argument) always use bytes in memory. This means that if a file is opened using an incorrect mode or encoding, I/O will likely fail loudly, instead of silently producing incorrect data. It also means that even Unix users will have to specify the correct mode (text or binary) when opening a file. There is a platform-dependent default encoding, which on Unixy platforms can be set with the LANG environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default. Any application reading or writing more than pure ASCII text should probably have a way to override the encoding. There is no longer any need for using the encoding-aware streams in the codecs module.\n", "- PEP 3120: The default source encoding is now UTF-8.\n", "" ] } ], "metadata": {} } ] }