{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Encoding\n", "\n", "## Goals\n", "\n", " - A string is more than a sequence of bytes\n", " - A string is a couple (bytes, encoding)\n", " - Use unicode_literals in python2\n", " - Manage differently encoded filenames\n", " - A string is not a sequence of bytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modules" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "import os.path\n", "import glob" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from os.path import isdir\n", "basedir = \"/tmp/course\"\n", "if not isdir(basedir):\n", " os.makedirs(basedir) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Song of Childhood\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "Als das Kind Kind war,\n", "ging es mit hängenden Armen,\n", "wollte der Bach sei ein Fluß,\n", "der Fluß sei ein Strom,\n", "und diese Pfütze das Meer.\n", "\n", "Als das Kind Kind war,\n", "wußte es nicht, daß es Kind war,\n", "alles war ihm beseelt,\n", "und alle Seelen waren eins.\n", "\n", "\n", "

\n", "\"When the child was a child,
\n", "characters were bytes, and
\n", "strings list of bytes.\"
\n", "

\n", "
\n", "When the child was a child\n", "It walked with its arms swinging,\n", "wanted the brook to be a river,\n", "the river to be a torrent,\n", "and this puddle to be the sea.\n", "\n", "When the child was a child,\n", "it didn’t know that it was a child,\n", "everything was soulful,\n", "and all souls were one.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Encoding is a map\n", "\n", "Encoding is a map between typographical characters and byte-sequences.\n", "\n", "Decoding is its reverse map.\n", "\n", "\n", "|char -> | utf8 | cp1252 | ascii |\n", "|-------------|--------|---|\n", "|y -> | [121] | [121] | [121] |\n", "|z -> | [122] | [122] | [122] |\n", "|{ -> | [123] | [123] | [123] |\n", "|¢ -> | [194, 162] | [162] | - |\n", "|£ -> | [194, 163] | [163] | - |\n", "|¤ -> | [194, 164] | [164] | - |\n", "|¥ -> | [194, 165] | [165] | - |\n", "|Ɓ -> | [198, 129] | - | - |\n", "|Ƃ -> | [198, 130] | - | - |\n", "|ƃ -> | [198, 131] | - | - |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Py3 doesn't need the 'u' prefix before the string.\n", "the_string = u\"S\\u00fcd\" # Sued\n", "print(the_string)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# the_string Sued can be encoded in different...\n", "in_utf8 = the_string.encode('utf-8')\n", "in_win = the_string.encode('cp1252')\n", "\n", "# ...byte-sequences\n", "assert type(in_utf8) == bytes " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Now you can see the differences between \n", "print(repr(in_utf8))\n", "# and\n", "print(repr(in_win))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Decoding bytes using the wrong map...\n", "# ...gives Süd results\n", "print(in_utf8.decode('cp1252'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Filenames are actually binary data\n", "# we should be careful when our scripts read\n", "# eg from a vfat filesystem.\n", "\n", "# To make Py2 encoding-aware we must\n", "from __future__ import unicode_literals, print_function\n", "\n", "# Create 3 windows-encoded filenames in\n", "\n", "# using the provided function\n", "from course import create_espana\n", "create_espana(basedir)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Just list the newly created files\n", "# and check that they are not showing correctly (unless we have windows :D)\n", "!dir {basedir}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from glob import glob as ls \n", "#expands wildcards like ls\n", "\n", "# To avoid encoding issue like the following...\n", "files = ls(\"/tmp/course/*.txt\")\n", "\n", "#UnicodeDecodeError: 'ascii' codec can't decode\n", "# byte 0xe9 in position 5: # remember ñ in cp1252\n", "# ordinal not in range(128)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We must explicitly use bytes prefixing with \"b\"\n", "files = ls(b\"/tmp/course/*.txt\")\n", "# And the file names are shown with bytes.\n", "print(files)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Exercise: don't run this cell! \n", "# Which outcome do you expect from the following instruction?\n", "print('\\n'.join(files))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }