{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Various things can go wrong when you import your data into Pandas. Some of these are immediately obvious; others only appear later, in confusing forms.\n", "\n", "This page covers one common problem when loading data into Pandas --- text encoding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas and encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "pd.set_option('mode.chained_assignment','raise')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the following annoying situation. You can download the data file from [imdblet_latin.csv](https://matthew-brett.github.io/cfd2019/data/imdblet_latin.csv)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "films = pd.read_csv('imdblet_latin.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next sections are about why this happens, and therefore, how to fix it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text encoding\n", "\n", "When the computer stores text in memory, or on disk, it must represent the\n", "characters in the text with numbers, because numbers are the computer's basic\n", "units of storage.\n", "\n", "The traditional unit of memory size, or disk size, is the\n", "[byte](https://en.wikipedia.org/wiki/Byte). Nowadays, the term byte means a\n", "single number that can take any value between 0 through 255. Specifically, a\n", "byte is a binary number with 8 binary digits, so it can store $2^8 = 256$\n", "different values --- 0 through 255.\n", "\n", "We can think of everything that the computer stores, in memory or on disk, as\n", "bytes --- units of information in memory, represented as numbers.\n", "\n", "This is also true for text. For example, here is a short piece of text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A short piece of text\n", "name = 'Pandas'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Somewhere in the computer's memory, Python has recorded \"Pandas\" as a series of\n", "bytes, in a format that it understands.\n", "\n", "When the computer writes this information into a file, it has to decide how to\n", "convert its own version of the text \"Pandas\" into bytes that other programs\n", "will understand. That is, it needs to convert its own format into a standard\n", "sequence of numbers (bytes) that other programs will recognize as the text\n", "\"Pandas\".\n", "\n", "This process of converting from Python's own format to a standard sequence of\n", "bytes, is called *encoding*. Whenever Python --- or any other program ---\n", "writes text to a file, it has to decide how to *encode* that text as a sequence\n", "of bytes.\n", "\n", "There are various standard ways of encoding text as numbers. One very common\n", "encoding is called [8-bit Unicode Transformation\n", "Format](https://en.wikipedia.org/wiki/UTF-8) or \"UTF-8\" for short. Almost all\n", "web page files use this format. Your web browser knows how to translate the\n", "numbers in this format into text to show on screen.\n", "\n", "We can see that process in memory, in Python, like this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert the text in \"name\" into bytes.\n", "name_as_utf8_bytes = name.encode('utf-8')\n", "# Show the bytes as numbers\n", "list(name_as_utf8_bytes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the UTF-8 coding scheme, the number 80 stands for the character 'P', 97\n", "stands for 'a', and so on. Notice that for these standard English alphabet\n", "characters, UTF-8 stores each character with a single byte (80 for 'P' , 97 for\n", "'a' etc).\n", "\n", "We can go the opposite direction, and *decode* the sequence of numbers (bytes)\n", "into a piece of text, like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert the sequence of numbers (bytes) into text again.\n", "name_again = name_as_utf8_bytes.decode('utf-8')\n", "name_again" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "UTF-8 is a particularly useful encoding, because it defines standard sequences\n", "of bytes that represent an enormous range of characters, including, for\n", "example, Mandarin and Cantonese Chinese characters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Hello in Mandarin.\n", "mandarin_hello = \"你好\"\n", "hello_as_bytes = mandarin_hello.encode('utf-8')\n", "list(hello_as_bytes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that, this time, UTF-8 used three bytes to represent each of the two\n", "Mandarin characters.\n", "\n", "Another common, but less useful encoding is called [Latin\n", "1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) or ISO-8859-1. This encoding\n", "only defines ways to represent text characters in the standard [Latin\n", "alphabet](https://en.wikipedia.org/wiki/Latin_script). This is the standard\n", "English alphabet plus a range of other characters from other European\n", "languages, including characters with accents.\n", "\n", "For English words using the standard English alphabet, Latin 1 uses the same\n", "set of character-to-byte mappings as UTF-8 does -- 80 for 'P' and so on:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "name_as_latin1_bytes = name.encode('latin1')\n", "list(name_as_latin1_bytes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The differences show up when the encodings generate bytes for characters\n", "outside the standard English alphabet. Here's the surname of [Fernando\n", "Pérez](https://en.wikipedia.org/wiki/Fernando_P%C3%A9rez_(software_developer))\n", "one of the founders of the Jupyter project you are using here:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jupyter_person = 'Pérez'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the bytes that UTF-8 needs to store that name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fp_as_utf8 = jupyter_person.encode('utf-8')\n", "list(fp_as_utf8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that UTF-8 still uses 80 for 'P'. The next two bytes --- 195 and 169\n", "--- represent the é in Fernando's name.\n", "\n", "In contrast, Latin 1 uses a single byte --- 233 -- to store the é:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fp_as_latin1 = jupyter_person.encode('latin1')\n", "list(fp_as_latin1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Latin 1 has no idea what to do about Mandarin:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "mandarin_hello.encode('latin1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now consider what will happen if the computer writes (encodes) some text in\n", "Latin 1 format, and then tries to read it (decode) assuming it is in UTF-8\n", "format:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "raises-exception" ] }, "outputs": [], "source": [ "fp_as_latin1.decode('utf-8')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's a mess - because UTF-8 doesn't know how to interpret the bytes that Latin\n", "1 wrote --- this sequence of bytes doesn't make sense in the UTF-8 encoding.\n", "\n", "Something similar happens when you write bytes (encode) text with UTF-8 and\n", "then read (decode) assuming the bytes are for Latin 1:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fp_as_utf8.decode('latin1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time there is no error, because the bytes from UTF-8 do mean something to\n", "Latin 1 --- but the text is wrong, because those bytes mean something\n", "*different* in Latin 1 than they do for UTF-8." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fixing encoding errors in Pandas\n", "\n", "With this background, you may have guessed that the problem that we had at the\n", "top of this page was because someone has written a file where the text is in a\n", "different *encoding* than the one that Pandas assumed.\n", "\n", "In fact, Pandas assumes that text is in UTF-8 format, because it is so common.\n", "\n", "In this case, as the filename suggests, the bytes for the text are in Latin\n", "1 encoding. We can tell Pandas about this with the `encoding=` option:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "films = pd.read_csv('imdblet_latin.csv', encoding='latin1')\n", "films.head()" ] } ], "metadata": { "jupytext": { "split_at_heading": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }