{ "cells": [ { "cell_type": "code", "execution_count": 18, "id": "5c7787e3", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import unicodedata" ] }, { "cell_type": "markdown", "id": "5f755830", "metadata": {}, "source": [ "# ASCII and Unicode Strings" ] }, { "cell_type": "markdown", "id": "49a860ad", "metadata": {}, "source": [ "Python has two types of \"strings\": byte strings and unicode strings.\n", "\n", "In fact, byte strings are used to represent multiple different kinds of data:\n", "\n", "- ASCII strings\n", "- arbitrary binary arrays of bytes\n", "- UTF-8 encoded unicode (as well as other encodings)\n", "\n", "These distinctions are not marked in the type. \n", "In contrast, Unicode strings are always just strings.\n", "\n", "Syntax to note is:\n", "\n", "- strings are written as `\"...\"`\n", "- unicode strings are written as `u\"...\"`\n", "- non-printable characters can be *escaped* in various ways\n", " - `\\x10` - byte 16 (hexadecimal)\n", " - `\\10` - byte 8 (octal)\n", " - `\\t` `\\n` `\\r` etc. - special characters\n", "- inside Unicode\n", " - `\\uxxxx` 16 bit unicode character\n", " - `\\Uxxxxxxxx` 32 bit unicode character\n" ] }, { "cell_type": "code", "execution_count": 82, "id": "2827ad49", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'\\\\d10'" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\\10\"" ] }, { "cell_type": "code", "execution_count": 19, "id": "c1b84e2f", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(\"Hello, World\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "36d4d560", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "unicode" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(u\"Hallo, wie gähhhtsch?\")" ] }, { "cell_type": "markdown", "id": "fdcf554e", "metadata": {}, "source": [ "The functions `ord` and `unichr` convert individual characters." ] }, { "cell_type": "code", "execution_count": 2, "id": "19cd3a89", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'M'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unichr(77)" ] }, { "cell_type": "code", "execution_count": 127, "id": "aed45807", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "12356" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ord(u\"い\")" ] }, { "cell_type": "markdown", "id": "89480db3", "metadata": {}, "source": [ "We refer to the number (integer) of a Unicode character as its *codepoint*.\n", "\n", "Unicode is really primarily an assignment of codepoints to characters and their properties. \n", "\n", "(The second important part of Unicode is encodings; we look at those below.)" ] }, { "cell_type": "markdown", "id": "2fc52d63", "metadata": {}, "source": [ "The function `unicode` converts a string to a unicode string." ] }, { "cell_type": "code", "execution_count": 128, "id": "f78a1703", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'abc'" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicode(\"abc\")" ] }, { "cell_type": "markdown", "id": "c2c8bb61", "metadata": {}, "source": [ "You can use `str` to convert from Unicode to a string, but it won't work for strings that can't be represented in ASCII. You really need a *codec* (coder-decoder); see below." ] }, { "cell_type": "code", "execution_count": 129, "id": "517019b5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'abc'" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str(u\"abc\")" ] }, { "cell_type": "code", "execution_count": 130, "id": "1a0eab6d", "metadata": { "collapsed": false }, "outputs": [ { "ename": "UnicodeEncodeError", "evalue": "'ascii' codec can't encode character u'\\xe4' in position 0: ordinal not in range(128)", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mstr\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mu\"äbc\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode character u'\\xe4' in position 0: ordinal not in range(128)" ] } ], "source": [ "str(u\"äbc\")" ] }, { "cell_type": "markdown", "id": "e5843241", "metadata": {}, "source": [ "Note that there is a diference between displaying output and printing it; the former uses\n", "the `repr` method, that tries to represent code in an ASCII/source friendly way independent\n", "of encoding." ] }, { "cell_type": "code", "execution_count": 123, "id": "83869bc0", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "u'\\u0200'\n" ] }, { "data": { "text/plain": [ "u'\\u0200'" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print repr(unichr(0x200))\n", "unichr(0x200)" ] }, { "cell_type": "code", "execution_count": 124, "id": "7744ac1b", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ȁ\n" ] } ], "source": [ "print unichr(0x200)" ] }, { "cell_type": "markdown", "id": "6479f0aa", "metadata": {}, "source": [ "Let's get an impression of what characters exist in Unicode." ] }, { "cell_type": "code", "execution_count": 17, "id": "f493e60b", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 100 ĀćĎĕĜģĪıĸĿņōŔśŢũŰŷžƅƌƓƚơƨƯƶƽDŽNjǒǙǠǧǮǵǼ\n", " 200 ȀȇȎȕȜȣȪȱȸȿɆɍɔɛɢɩɰɷɾʅʌʓʚʡʨʯʶʽ˄ˋ˒˙ˠ˧ˮ˵˼\n", " 300 ̸̜̣̪̱͍͔̀̇̎̿͆͛ͩ̕͢Ͱͷ;΅ΌΓΚΡΨίζντϋϒϙϠϧϮϵϼ\n", " 400 ЀЇЎЕМУЪбипцэєћѢѩѰѷѾ҅ҌғҚҡҨүҶҽӄӋӒәӠӧӮӵӼ\n", " 500 ԀԇԎԕԜԣԪԱԸԿՆՍՔ՛բթհշվօ֌ֶֽ֚֓֡֨֯ׄ׋גינק׮׵׼\n", " 600 ؀؇؎ؕ؜أترظؿنٍٔٛ٢٩ٰٷپڅڌړښڡڨگڶڽۄۋےۙ۠ۧۮ۵ۼ\n", " 700 ܀܇܎ܕܜܣܪܱܸ݆ܿݍݔݛݢݩݰݷݾޅތޓޚޡިޯ޶޽߄ߋߒߙߠߧ߮ߵ߼\n", " 800 ࠀࠇࠎࠕࠜࠣࠪ࠱࠸࠿ࡆࡍࡔ࡛ࡢࡩࡰࡷࡾࢅࢌ࢓࢚ࢡࢨࢯࢶࢽࣄ࣒࣮࣋ࣙ࣠ࣧࣵࣼ\n", " 900 ऀइऎकजणपऱसिॆ्॔ज़ॢ३॰ॷॾঅঌওচডনযশঽৄো৒৙ৠ১৮৵ৼ\n", " a00 ਀ਇ਎ਕਜਣਪ਱ਸਿ੆੍੔ਜ਼੢੩ੰ੷੾અઌઓચડનયશઽૄો૒૙ૠ૧૮૵ૼ\n", " b00 ଀ଇ଎କଜଣପ଱ସି୆୍୔୛ୢ୩୰୷୾அ஌ஓச஡நயஶ஽௄ோ௒௙௠௧௮௵௼\n", " c00 ఀఇఎకజణపఱసిె్౔౛ౢ౩౰౷౾ಅಌಓಚಡನಯಶಽೄೋ೒೙ೠ೧೮೵೼\n", " d00 ഀഇഎകജണപറസിെ്ൔ൛ൢ൩൰൷ൾඅඌඓකඡඨදබලහ෋ිෙ෠෧෮෵෼\n", " e00 ฀งฎตผรสัุ฿ๆํ๔๛๢๩๰๷๾຅ຌຓບມຨຯຶຽໄ໋໒໙໠໧໮໵໼\n", " f00 ༀ༇༎༕༜༣༪༱༸༿ཆཌྷཔཛརཀྵ཰ཷཾ྅ྌྒྷྚྡྨྯྶ྽࿄࿋࿒࿙࿠࿧࿮࿵࿼\n", "1000 ကဇဎပလဣဪေးဿ၆၍ၔၛၢၩၰၷၾႅႌ႓ႚႡႨႯႶႽჄ჋გკრყხჵჼ\n", "1100 ᄀᄇᄎᄕᄜᄣᄪᄱᄸᄿᅆᅍᅔᅛᅢᅩᅰᅷᅾᆅᆌᆓᆚᆡᆨᆯᆶᆽᇄᇋᇒᇙᇠᇧᇮᇵᇼ\n", "1200 ሀሇሎሕሜሣሪሱሸሿቆቍቔቛቢቩተቷቾኅኌናኚኡከኯ኶ኽዄዋዒዙዠዧዮድዼ\n", "1300 ጀጇጎጕጜጣጪጱጸጿፆፍፔ፛።፩፰፷፾ᎅᎌ᎓᎚ᎡᎨᎯᎶᎽᏄᏋᏒᏙᏠᏧᏮᏵᏼ\n", "1400 ᐀ᐇᐎᐕᐜᐣᐪᐱᐸᐿᑆᑍᑔᑛᑢᑩᑰᑷᑾᒅᒌᒓᒚᒡᒨᒯᒶᒽᓄᓋᓒᓙᓠᓧᓮᓵᓼ\n", "1500 ᔀᔇᔎᔕᔜᔣᔪᔱᔸᔿᕆᕍᕔᕛᕢᕩᕰᕷᕾᖅᖌᖓᖚᖡᖨᖯᖶᖽᗄᗋᗒᗙᗠᗧᗮᗵᗼ\n", "1600 ᘀᘇᘎᘕᘜᘣᘪᘱᘸᘿᙆᙍᙔᙛᙢᙩᙰᙷᙾᚅᚌᚓᚚᚡᚨᚯᚶᚽᛄᛋᛒᛙᛠᛧᛮᛵ᛼\n", "1700 ᜀᜇᜎ᜕᜜ᜣᜪᜱ᜸᜿ᝆᝍ᝔᝛ᝢᝩᝰ᝷᝾ចឌនរឡឨឯាួោ់្៙០៧៮៵៼\n", "1800 ᠀᠇᠎᠕᠜ᠣᠪᠱᠸᠿᡆᡍᡔᡛᡢᡩᡰᡷ᡾ᢅᢌᢓᢚᢡᢨ᢯ᢶᢽᣄᣋᣒᣙᣠᣧᣮᣵ᣼\n", "1900 ᤀᤇᤎᤕᤜᤣᤪᤱᤸ᤿᥆᥍ᥔᥛᥢᥩᥰ᥷᥾ᦅᦌᦓᦚᦡᦨ᦯ᦶᦽᧄ᧋᧒᧙᧠᧧᧮᧵᧼\n", "1a00 ᨀᨇᨎᨕ᨜ᨣᨪᨱᨸᨿᩆᩍᩔᩛᩢᩩᩰ᩷᩾᪅᪌᪓᪚᪡᪨᪯᪶᪽᫄᫋᫒᫙᫠᫧᫮᫵᫼\n", "1b00 ᬀᬇᬎᬕᬜᬣᬪᬱᬸᬿᭆ᭍᭔᭛᭢᭩᭰᭷᭾ᮅᮌᮓᮚᮡᮨᮯ᮶ᮽᯄᯋᯒᯙᯠᯧᯮ᯵᯼\n", "1c00 ᰀᰇᰎᰕᰜᰣᰪᰱ᰸᰿᱆ᱍ᱔ᱛᱢᱩᱰᱷ᱾ᲅ᲌ᲓᲚᲡᲨᲯᲶᲽ᳄᳋᳧᳙᳒᳠ᳮᳵ᳼\n", "1d00 ᴀᴇᴎᴕᴜᴣᴪᴱᴸᴿᵆᵍᵔᵛᵢᵩᵰᵷᵾᶅᶌᶓᶚᶡᶨᶯᶶᶽ᷄᷋᷒ᷙᷠᷧᷮ᷵᷼\n", "1e00 ḀḇḎḕḜḣḪḱḸḿṆṍṔṛṢṩṰṷṾẅẌẓẚạẨắẶẽỄịỒộỠủỮỵỼ\n", "1f00 ἀἇἎἕἜἣἪἱἸἿ὆ὍὔὛὢὩὰί὾ᾅᾌᾓᾚᾡᾨᾯᾶ᾽ῄΉῒῙῠῧ΅῵ῼ\n", "2000   ‎―“‣‪‱‸‿⁆⁍⁔⁛⁢⁩⁰⁷⁾₅₌ₓₚ₡₨₯₶₽⃄⃋⃒⃙⃠⃮⃧⃵⃼\n", "2100 ℀ℇℎℕℜ℣Kℱℸℿⅆ⅍⅔⅛ⅢⅩⅰⅷⅾↅ↌↓↚↡↨↯↶↽⇄⇋⇒⇙⇠⇧⇮⇵⇼\n", "2200 ∀∇∎∕∜∣∪∱∸∿≆≍≔≛≢≩≰≷≾⊅⊌⊓⊚⊡⊨⊯⊶⊽⋄⋋⋒⋙⋠⋧⋮⋵⋼\n", "2300 ⌀⌇⌎⌕⌜⌣〉⌱⌸⌿⍆⍍⍔⍛⍢⍩⍰⍷⍾⎅⎌⎓⎚⎡⎨⎯⎶⎽⏄⏋⏒⏙⏠⏧⏮⏵⏼\n", "2400 ␀␇␎␕␜␣␪␱␸␿⑆⑍⑔⑛③⑩⑰⑷⑾⒅⒌⒓⒚⒡⒨⒯ⒶⒽⓄⓋⓒⓙⓠⓧ⓮⓵⓼\n", "2500 ─┇┎┕├┣┪┱┸┿╆╍╔╛╢╩╰╷╾▅▌▓▚□▨▯▶▽◄○◒◙◠◧◮◵◼\n", "2600 ☀☇☎☕☜☣☪☱☸☿♆♍♔♛♢♩♰♷♾⚅⚌⚓⚚⚡⚨⚯⚶⚽⛄⛋⛒⛙⛠⛧⛮⛵⛼\n", "2700 ✀✇✎✕✜✣✪✱✸✿❆❍❔❛❢❩❰❷❾➅➌➓➚➡➨➯➶➽⟄⟋⟒⟙⟠⟧⟮⟵⟼\n", "2800 ⠀⠇⠎⠕⠜⠣⠪⠱⠸⠿⡆⡍⡔⡛⡢⡩⡰⡷⡾⢅⢌⢓⢚⢡⢨⢯⢶⢽⣄⣋⣒⣙⣠⣧⣮⣵⣼\n", "2900 ⤀⤇⤎⤕⤜⤣⤪⤱⤸⤿⥆⥍⥔⥛⥢⥩⥰⥷⥾⦅⦌⦓⦚⦡⦨⦯⦶⦽⧄⧋⧒⧙⧠⧧⧮⧵⧼\n", "2a00 ⨀⨇⨎⨕⨜⨣⨪⨱⨸⨿⩆⩍⩔⩛⩢⩩⩰⩷⩾⪅⪌⪓⪚⪡⪨⪯⪶⪽⫄⫋⫒⫙⫠⫧⫮⫵⫼\n", "2b00 ⬀⬇⬎⬕⬜⬣⬪⬱⬸⬿⭆⭍⭔⭛⭢⭩⭰⭷⭾⮅⮌⮓⮚⮡⮨⮯⮶⮽⯄⯋⯒⯙⯠⯧⯮⯵⯼\n", "2c00 ⰀⰇⰎⰕⰜⰣⰪⰱⰸⰿⱆⱍⱔⱛⱢⱩⱰⱷⱾⲅⲌⲓⲚⲡⲨⲯⲶⲽⳄⳋⳒⳙⳠ⳧ⳮ⳵⳼\n", "2d00 ⴀⴇⴎⴕⴜⴣ⴪ⴱⴸⴿⵆⵍⵔⵛⵢ⵩⵰⵷⵾ⶅⶌⶓ⶚ⶡⶨ⶯ⶶⶽⷄⷋⷒⷙⷠⷧⷮⷵⷼ\n", "2e00 ⸀⸇⸎⸕⸜⸣⸪⸱⸸⸿⹆⹍⹔⹛⹢⹩⹰⹷⹾⺅⺌⺓⺚⺡⺨⺯⺶⺽⻄⻋⻒⻙⻠⻧⻮⻵⻼\n", "2f00 ⼀⼇⼎⼕⼜⼣⼪⼱⼸⼿⽆⽍⽔⽛⽢⽩⽰⽷⽾⾅⾌⾓⾚⾡⾨⾯⾶⾽⿄⿋⿒⿙⿠⿧⿮⿵⿼\n", "3000  〇『〕〜〣〪〱〸〿うきごせぢどばぷまゅれん゚ァエクザソツニヒベムョヮヵー\n", "3100 ㄀ㄇㄎㄕㄜㄣㄪㄱㄸㄿㅆㅍㅔㅛㅢㅩㅰㅷㅾㆅㆌ㆓㆚ㆡㆨㆯㆶㆽ㇄㇋㇒㇙㇠㇧㇮ㇵㇼ\n", "3200 ㈀㈇㈎㈕㈜㈣㈪㈱㈸㈿㉆㉍㉔㉛㉢㉩㉰㉷㉾㊅㊌㊓㊚㊡㊨㊯㊶㊽㋄㋋㋒㋙㋠㋧㋮㋵㋼\n", "3300 ㌀㌇㌎㌕㌜㌣㌪㌱㌸㌿㍆㍍㍔㍛㍢㍩㍰㍷㍾㎅㎌㎓㎚㎡㎨㎯㎶㎽㏄㏋㏒㏙㏠㏧㏮㏵㏼\n", "3400 㐀㐇㐎㐕㐜㐣㐪㐱㐸㐿㑆㑍㑔㑛㑢㑩㑰㑷㑾㒅㒌㒓㒚㒡㒨㒯㒶㒽㓄㓋㓒㓙㓠㓧㓮㓵㓼\n", "3500 㔀㔇㔎㔕㔜㔣㔪㔱㔸㔿㕆㕍㕔㕛㕢㕩㕰㕷㕾㖅㖌㖓㖚㖡㖨㖯㖶㖽㗄㗋㗒㗙㗠㗧㗮㗵㗼\n", "3600 㘀㘇㘎㘕㘜㘣㘪㘱㘸㘿㙆㙍㙔㙛㙢㙩㙰㙷㙾㚅㚌㚓㚚㚡㚨㚯㚶㚽㛄㛋㛒㛙㛠㛧㛮㛵㛼\n", "3700 㜀㜇㜎㜕㜜㜣㜪㜱㜸㜿㝆㝍㝔㝛㝢㝩㝰㝷㝾㞅㞌㞓㞚㞡㞨㞯㞶㞽㟄㟋㟒㟙㟠㟧㟮㟵㟼\n", "3800 㠀㠇㠎㠕㠜㠣㠪㠱㠸㠿㡆㡍㡔㡛㡢㡩㡰㡷㡾㢅㢌㢓㢚㢡㢨㢯㢶㢽㣄㣋㣒㣙㣠㣧㣮㣵㣼\n", "3900 㤀㤇㤎㤕㤜㤣㤪㤱㤸㤿㥆㥍㥔㥛㥢㥩㥰㥷㥾㦅㦌㦓㦚㦡㦨㦯㦶㦽㧄㧋㧒㧙㧠㧧㧮㧵㧼\n", "3a00 㨀㨇㨎㨕㨜㨣㨪㨱㨸㨿㩆㩍㩔㩛㩢㩩㩰㩷㩾㪅㪌㪓㪚㪡㪨㪯㪶㪽㫄㫋㫒㫙㫠㫧㫮㫵㫼\n", "3b00 㬀㬇㬎㬕㬜㬣㬪㬱㬸㬿㭆㭍㭔㭛㭢㭩㭰㭷㭾㮅㮌㮓㮚㮡㮨㮯㮶㮽㯄㯋㯒㯙㯠㯧㯮㯵㯼\n", "3c00 㰀㰇㰎㰕㰜㰣㰪㰱㰸㰿㱆㱍㱔㱛㱢㱩㱰㱷㱾㲅㲌㲓㲚㲡㲨㲯㲶㲽㳄㳋㳒㳙㳠㳧㳮㳵㳼\n", "3d00 㴀㴇㴎㴕㴜㴣㴪㴱㴸㴿㵆㵍㵔㵛㵢㵩㵰㵷㵾㶅㶌㶓㶚㶡㶨㶯㶶㶽㷄㷋㷒㷙㷠㷧㷮㷵㷼\n", "3e00 㸀㸇㸎㸕㸜㸣㸪㸱㸸㸿㹆㹍㹔㹛㹢㹩㹰㹷㹾㺅㺌㺓㺚㺡㺨㺯㺶㺽㻄㻋㻒㻙㻠㻧㻮㻵㻼\n", "3f00 㼀㼇㼎㼕㼜㼣㼪㼱㼸㼿㽆㽍㽔㽛㽢㽩㽰㽷㽾㾅㾌㾓㾚㾡㾨㾯㾶㾽㿄㿋㿒㿙㿠㿧㿮㿵㿼\n" ] } ], "source": [ "for i in range(0x0100,0x4000,256):\n", " print \"%4x\"%i,\"\".join([unichr(j) for j in range(i,i+256,7)])" ] }, { "cell_type": "markdown", "id": "975156d8", "metadata": {}, "source": [ "# Encoding and Decoding" ] }, { "cell_type": "markdown", "id": "98b67da9", "metadata": {}, "source": [ "There are three kind of \"strings\" you have to think about:\n", "\n", "- ASCII - the traditional US-English character set, used for most programming\n", " - 128 code points\n", " - 7 bits per character\n", "- Unicode - the full character set representing all characters in the world\n", " - 110000 characters in 100 scripts\n", "- UTF-8 - an encoding of Unicode\n", " - encoded Unicode with 8 bits per character\n", "\n", "Important property:\n", "\n", "The UTF-8, ASCII, and Unicode are all \"the same\" if all the characters are in the ASCII character set.\n", "\n", "In Python, encoding and decoding is performed via the `encode` and `decode` methods.\n", "They take a *codec* name as an argument (`ascii` or `utf-8` are the only ones that are\n", "relevant to us), plus an optional argument saying what should happen if a string\n", "is not de/encodable." ] }, { "cell_type": "code", "execution_count": 61, "id": "9e2c3bf0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'abc'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u\"abc\".encode(\"ascii\")" ] }, { "cell_type": "code", "execution_count": 66, "id": "a0d41bcd", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'?bc'" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u\"äbc\".encode(\"ascii\",\"replace\")" ] }, { "cell_type": "markdown", "id": "99ec9e43", "metadata": {}, "source": [ "Let's look at a non-ASCII character. As you can see here, the German umlaut \"ä\" turns into a two character sequence when encoded in UTF-8. Each character has the high bit set. You can look up the exact encoding scheme online." ] }, { "cell_type": "code", "execution_count": 67, "id": "6d3ac5d0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'\\xc3\\xa4bc'" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u\"äbc\".encode(\"utf-8\")" ] }, { "cell_type": "markdown", "id": "d5c76688", "metadata": {}, "source": [ "Some more unusual characters are encoded as four byte sequences in UTF-8." ] }, { "cell_type": "code", "execution_count": 68, "id": "fa666bdd", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'\\xf0\\x9d\\x8d\\xa2'" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u\"𝍢\".encode(\"utf-8\")" ] }, { "cell_type": "markdown", "id": "e32c576c", "metadata": {}, "source": [ "Note that although four bytes (i.e. 32 bits) are used for encoding this character, its codepoint is only 119650.\n", "That's because only a few bits are used from each 8 bit code." ] }, { "cell_type": "code", "execution_count": 83, "id": "89a79134", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "119650" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ord(u\"𝍢\")" ] }, { "cell_type": "markdown", "id": "55f24ebe", "metadata": {}, "source": [ "Of course, we can also decode." ] }, { "cell_type": "code", "execution_count": 86, "id": "da1babd9", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "äbc\n" ] } ], "source": [ "print '\\xc3\\xa4bc'.decode(\"utf-8\")" ] }, { "cell_type": "markdown", "id": "a1973c89", "metadata": {}, "source": [ "You get an error message if the decoding is not possible." ] }, { "cell_type": "code", "execution_count": 87, "id": "a1e65ea5", "metadata": { "collapsed": false }, "outputs": [ { "ename": "UnicodeDecodeError", "evalue": "'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mprint\u001b[0m \u001b[1;34m'\\xc3\\xa4bc'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"ascii\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)" ] } ], "source": [ "print '\\xc3\\xa4bc'.decode(\"ascii\")" ] }, { "cell_type": "markdown", "id": "39c3ff4c", "metadata": {}, "source": [ "# File I/O" ] }, { "cell_type": "markdown", "id": "1d739535", "metadata": {}, "source": [ "Standard file descriptors in Python cannot encode/decode UTF-8." ] }, { "cell_type": "code", "execution_count": 73, "id": "0e3a86ed", "metadata": { "collapsed": false }, "outputs": [ { "ename": "UnicodeEncodeError", "evalue": "'ascii' codec can't encode character u'\\xe4' in position 1: ordinal not in range(128)", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"temp\"\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m\"w\"\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mstream\u001b[0m\u001b[1;33m:\u001b[0m \u001b[0mstream\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mu\"Käse und Brot\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode character u'\\xe4' in position 1: ordinal not in range(128)" ] } ], "source": [ "with open(\"temp\",\"w\") as stream: stream.write(u\"Käse und Brot\")" ] }, { "cell_type": "markdown", "id": "dc5c7d45", "metadata": {}, "source": [ "To read and write Unicode, use the `codecs.open` function.\n", "It returns a standard file object, but it does the right kind of\n", "encoding/decoding for UTF-8." ] }, { "cell_type": "code", "execution_count": 76, "id": "710a274c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import codecs\n", "with codecs.open(\"temp\",\"w\",\"utf-8\") as stream: stream.write(u\"Käse und Brot\")" ] }, { "cell_type": "code", "execution_count": 77, "id": "2159d87b", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Käse und Brot\n" ] } ], "source": [ "with codecs.open(\"temp\",\"r\",\"utf-8\") as stream: print stream.read()" ] }, { "cell_type": "code", "execution_count": 78, "id": "c77b8029", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KM-CM-$se und Brot" ] } ], "source": [ "!cat -v temp" ] }, { "cell_type": "markdown", "id": "42e495a5", "metadata": {}, "source": [ "# Unicode Data" ] }, { "cell_type": "markdown", "id": "ed7e8ee2", "metadata": {}, "source": [ "The Unicode consortium defines a lot of information associated with each codepoint.\n", "In Python, you can query this information using the `unicodedata` library." ] }, { "cell_type": "code", "execution_count": null, "id": "d4a8370e", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import unicodedata" ] }, { "cell_type": "markdown", "id": "317003c0", "metadata": {}, "source": [ "Information about each codepoint includes:\n", "\n", "- the full name of the character\n", "- the block it is from\n", "- a two letter category (e.g., \"Ll\" for Letter, lower case)\n", "- whether it is a combining character\n", "- the writing direction (BIDI)\n", "- what it decomposes into, if anything\n", "- whether it is a mirror of another character\n", "- the version of Unicode that defines it\n", "- additional information about some characters, like the numerical value of digits" ] }, { "cell_type": "code", "execution_count": 110, "id": "b5316668", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'LATIN SMALL LETTER A WITH DIAERESIS'" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.name(u\"ä\")" ] }, { "cell_type": "code", "execution_count": 115, "id": "0ada0cda", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Ll'" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.category(u\"ä\")" ] }, { "cell_type": "code", "execution_count": 111, "id": "69ed41ce", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "228" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ord(u\"ä\")" ] }, { "cell_type": "code", "execution_count": 112, "id": "692c5bff", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.decimal(u\"3\")" ] }, { "cell_type": "code", "execution_count": 113, "id": "7a900b1d", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.numeric(u\"四\")" ] }, { "cell_type": "code", "execution_count": 114, "id": "3c2597aa", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'Ll'" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.category(u\"ß\")" ] }, { "cell_type": "markdown", "id": "c6af9018", "metadata": {}, "source": [ "# Decomposition and Normalization" ] }, { "cell_type": "markdown", "id": "eb59a112", "metadata": {}, "source": [ "Various letters are really just combined forms of separate parts.\n", "\n", "For example, the letter \"ä\" can be viewed as a combination of the letter \"a\" with the diacritic \" ̈\".\n", "\n", "The Unicode consortium hasn't been consistent about how to represent these, so the same letter as it appears on the screen can be represented in two different ways." ] }, { "cell_type": "code", "execution_count": 120, "id": "b0df2824", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ä\n", "ä\n" ] } ], "source": [ "print u'\\u00e4'\n", "print u'a\\u0308'" ] }, { "cell_type": "markdown", "id": "86a1b39d", "metadata": {}, "source": [ "Although these strings look the same, they are represented differently." ] }, { "cell_type": "code", "execution_count": 121, "id": "5e0cd869", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u'\\u00e4'==u'a\\u0308'" ] }, { "cell_type": "markdown", "id": "9c03f299", "metadata": {}, "source": [ "Unicodedata can decompose characters." ] }, { "cell_type": "code", "execution_count": 30, "id": "419dd7a1", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'0061 0308'" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.decomposition(u\"ä\")" ] }, { "cell_type": "markdown", "id": "9ba1788d", "metadata": {}, "source": [ "More generally, it can normalize a string into one of four forms:\n", "\n", "- NFD - decomposed by canonical equivalence, combining characters arranged in specific order\n", "- NFC - decomposed and then recomposed by canonical equivalence\n", "- NFKD - decomposed by compatibility, combining characters arranged in specific order\n", "- NFKC - decomposed by compatibility, then recomposed by canonical equivalence\n", "\n", "What does this mean?\n", "\n", "- canonical equivalence: same appearance and same meaning when printed\n", "- compatible: distinct appearance but usually the same meaning\n", "\n", "For example, \"ff\" as a ligature is compatible with the two letters \"ff\", but not canonically equivalent (since they look different).\n", "\n", "- NFC - best compatibility with conversions from legacy encodings\n", "- NFKC - preferred for identifiers, best security\n", "- NFD - easier to process\n", "- NFKD - easier to process\n", "\n", "Yes, unfortunately, you do need to worry about this." ] }, { "cell_type": "code", "execution_count": 37, "id": "311cdda4", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NFC u'\\xe4' ä\n", "NFKC u'\\xe4' ä\n", "NFD u'a\\u0308' ä\n", "NFKD u'a\\u0308' ä\n" ] } ], "source": [ "for n in [\"NFC\",\"NFKC\",\"NFD\",\"NFKD\"]:\n", " s = unicodedata.normalize(n,u\"ä\")\n", " print n,repr(s),s" ] }, { "cell_type": "code", "execution_count": 106, "id": "61fa743d", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "r̈\n" ] } ], "source": [ "print u\"r\\u0308\"" ] }, { "cell_type": "code", "execution_count": 107, "id": "2ffa3f2b", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+̈\n" ] } ], "source": [ "print u\"+\\u0308\"" ] }, { "cell_type": "code", "execution_count": 108, "id": "f219ffc8", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "̈\n" ] } ], "source": [ "print u\"\\u0308\"" ] }, { "cell_type": "code", "execution_count": 109, "id": "226e0102", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ̈\n" ] } ], "source": [ "print u\" \\u0308\"" ] }, { "cell_type": "markdown", "id": "23fd72ba", "metadata": {}, "source": [ "# Ligatures" ] }, { "cell_type": "markdown", "id": "6c7dbe25", "metadata": {}, "source": [ "Many languages have ligatures. In some languages and scripts (e.g., German), ligatures like \"ä\" and \"ß\" have become letters in their own right. In other scripts, ligatures are just different presentations depending on the context a character appears in; in those cases, Unicode does not represent ligatures as separate code points." ] }, { "cell_type": "markdown", "id": "7d1977ee", "metadata": {}, "source": [ "Here is an example in Arabic. Note how the string looks very different when printed a character at a time vs. when printed as a word (also note that Arabic is a right-to-left language):" ] }, { "cell_type": "code", "execution_count": 91, "id": "7a3df6a1", "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = u\"كتاب\"" ] }, { "cell_type": "code", "execution_count": 92, "id": "151b9d5c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "كتاب\n" ] } ], "source": [ "print s" ] }, { "cell_type": "code", "execution_count": 95, "id": "02917b08", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ك ت ا ب\n" ] } ], "source": [ "for c in s: print c,\n", "print" ] }, { "cell_type": "markdown", "id": "e2cbfedc", "metadata": {}, "source": [ "In contrast, the \"ffi\" ligature in English has its own Unicode codepoint." ] }, { "cell_type": "code", "execution_count": 96, "id": "25d37ce7", "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = u\"a\\ufb03ne\"" ] }, { "cell_type": "code", "execution_count": 97, "id": "4145e9c2", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "affine\n" ] } ], "source": [ "print s" ] }, { "cell_type": "code", "execution_count": 98, "id": "fb2bcfeb", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a ffi n e\n" ] } ], "source": [ "for c in s: print c,\n", "print" ] }, { "cell_type": "code", "execution_count": 102, "id": "f070f863", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s==u\"affine\"" ] }, { "cell_type": "code", "execution_count": 103, "id": "98e51a5a", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'affine'" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.normalize(\"NFKD\",s)" ] }, { "cell_type": "code", "execution_count": 122, "id": "94d71588", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unicodedata.normalize(\"NFKD\",s)==unicodedata.normalize(\"NFKD\",u\"affine\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1e9789b5", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }