{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Инфраструктура Python. Строки, даты, коллекции" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Строки" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Упоминаются библиотеки `string`, `unicodedata`, `PyICU`, `transliterate`, `base64`, `chardet`, `pycld2`, `python-Levenshtein`, `difflib`, `python-finediff`, `bsdiff4`, `re`, `regex`, `lark-parser`, `datrie`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AB\n", "aaaaaBBB\n", "65 A\n", "Й Й\n" ] } ], "source": [ "print 'ABC'[:-1]\n", "print 'a' * 5 + '\\x42' * 3\n", "print ord('A'), chr(65)\n", "print unichr(0x419), u'\\u0419'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Часто используемые методы строк:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True False\n", "2\n", "23\n", "-1\n" ] } ], "source": [ "s = 'what was that? antonovka'\n", "print s.startswith('what'), s.endswith('golden')\n", "print s.find('a')\n", "print s.rfind('a')\n", "print s.find('x')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n" ] } ], "source": [ "print s.count('a')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "what was that? antonovka\n", "WHAT WAS THAT? ANTONOVKA\n", "What Was That? Antonovka\n", "What was that? antonovka\n" ] } ], "source": [ "print s.lower()\n", "print s.upper()\n", "print s.title()\n", "print s.capitalize()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'abc'\n", "'__abc'\n" ] } ], "source": [ "print repr(' abc '.strip())\n", "print repr('abc'.rjust(5, '_'))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['what', 'was', 'that?', 'antonovka']\n", "['wh', 't w', 's th', 't? ', 'ntonovk', '']\n", "what|was|that?|antonovka\n" ] } ], "source": [ "print s.split()\n", "print s.split('a')\n", "print '|'.join(s.split())" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "wh[A]t w[A]s th[A]t? [A]ntonovk[A]\n" ] } ], "source": [ "print s.replace('a', '[A]')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python поддерживает форматированный вывод в стиле C `printf`. Подробнее о форматной строке https://docs.python.org/2/library/stdtypes.html#string-formatting-operations" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hi 00042 3.1416 deadbeef\n" ] } ], "source": [ "from math import pi\n", "print '%s %05d %.4f %x' % ('hi', 42, pi, 3735928559)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "007\n" ] } ], "source": [ "print '%0*d' % (3, 7)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "repr is {'abc': 'def'}\n" ] } ], "source": [ "print 'repr is %r' % {'abc': 'def'}" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "42\n", "hello\n" ] } ], "source": [ "var = 'hello'\n", "print '%(var)d' % {'var': 42}\n", "print '%(var)s' % globals()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если вам понадобился аналогичный метод `str.format` со своим более мощным синтаксисом форматной строки, где-то ваша жизнь свернула не туда." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'***1,234,567,890****'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'{:*^20,}'.format(1234567890)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import string" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'this was a triumph'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tpl = string.Template('$subj was a $obj')\n", "tpl.substitute(subj='this', obj='triumph')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Классы символов https://docs.python.org/2/library/string.html#string-constants" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\n", "ABCDEFGHIJKLMNOPQRSTUVWXYZ\n", "0123456789\n", "True False\n" ] } ], "source": [ "print string.ascii_letters\n", "print string.ascii_uppercase\n", "print string.digits\n", "print 'A'.isupper(), '221B'.isdigit()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Й CYRILLIC CAPITAL LETTER SHORT I\n", "Lu\n", "u'\\u0418\\u0306'\n", "CYRILLIC CAPITAL LETTER I + COMBINING BREVE\n" ] } ], "source": [ "import unicodedata\n", "u = u'Й'\n", "print unicodedata.lookup('CYRILLIC CAPITAL LETTER SHORT I'), unicodedata.name(u)\n", "print unicodedata.category(u) # 'L'etter, 'u'ppercase\n", "print repr(unicodedata.normalize('NFD', u))\n", "print ' + '.join([unicodedata.name(ch) for ch in unicodedata.normalize('NFD', u)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NFD это способ приведения Unicode-строки, которая может записывать один и тот же видимый символ несколькими разными последовательностями Unicode-символов, к «общему знаменателю». NFD пытается каждый символ разбить на составляющие (й → и + бреве), NFC наоборот, собрать." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для активной работы с интернационализацией и региональными языковыми стандартами используется отраслевой стандарт, библиотека ICU (International Components for Unicode) и враппер над ней `PyICU`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "Из VC++2008 command prompt\n", "conda list\n", "Найти номер версии icu в списке\n", "\n", "set ICU_VERSION=58.2 -- подставить номер версии\n", "set PYICU_INCLUDES=%USERPROFILE%/Anaconda2/Library/include\n", "set LIB=%LIB%;%USERPROFILE%/Anaconda2/Library/lib\n", "\n", "pip install PyICU --global-option build_ext --global-option --compiler=msvc\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import icu" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ß I\n", "SS I\n", "SS İ\n" ] } ], "source": [ "s = u'ß i'\n", "print s.upper()\n", "print unicode(icu.UnicodeString(s).toUpper(icu.Locale('de_DE')))\n", "print unicode(icu.UnicodeString(s).toUpper(icu.Locale('tr_TR')))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "жикё\n", "ёжик\n" ] } ], "source": [ "print ''.join(sorted(u'ёжик'))\n", "collator = icu.Collator.createInstance(icu.Locale('ru_RU'))\n", "print ''.join(sorted(u'ёжик', key=collator.getSortKey))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Хорошее введение в проблемы, связанные с Unicode-строками, можно найти здесь https://github.com/CppCon/CppCon2014/blob/master/Presentations/Unicode%20in%20C%2B%2B/Unicode%20in%20C%2B%2B%20-%20McNellis%20-%20CppCon%202014.pdf" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cent vingt-trois\n" ] } ], "source": [ "print icu.RuleBasedNumberFormat(icu.URBNFRuleSetTag.SPELLOUT, icu.Locale('fr')).format(123)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "21 липня 2019 р. о 20:06:56\n" ] } ], "source": [ "from datetime import datetime\n", "formatter = icu.DateFormat.createDateTimeInstance(icu.DateFormat.LONG, icu.DateFormat.kDefault, icu.Locale('uk_UA'))\n", "print formatter.format(datetime.now())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sʺyeshʹ yeshchë etikh myagkikh frantsuzskikh bulok da vypey chayu\n" ] } ], "source": [ "transl = icu.Transliterator.createInstance('Russian-Latin/BGN')\n", "print transl.transliterate('Съешь ещё этих мягких французских булок да выпей чаю')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "578" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(list(icu.Transliterator.getAvailableIDs()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Несколько более привычный вариант транслитерации предоставляет библиотека `transliterate`. Вообще же задача транслитерации кириллицы в латиницу непроста и насчитывает минимум 15 конкурирующих стандартов https://ru.wikipedia.org/wiki/Транслитерация_русского_алфавита_латиницей" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install transliterate\n", "```" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "S'esh' esche etih mjagkih frantsuzskih bulok da vypej chaju\n" ] } ], "source": [ "import transliterate\n", "print transliterate.translit(u'Съешь ещё этих мягких французских булок да выпей чаю', reversed=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если видишь на клетке надпись \"бНОПНЯ\", не верь глазам своим. Текст был записан в кодировке CP1251, а отобразили его, считая, что это кодировка KOI8-R. https://docs.python.org/2.7/library/codecs.html#standard-encodings" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Вопрос\n" ] } ], "source": [ "print (u'бНОПНЯ'.encode('koi8_r').decode('cp1251'))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "d092d0bed0bfd180d0bed181\n", "u'\\u0412\\u043e\\u043f\\u0440\\u043e\\u0441'\n", "'\\xd0\\x92\\xd0\\xbe\\xd0\\xbf\\xd1\\x80\\xd0\\xbe\\xd1\\x81'\n", "c2eeeff0eef1\n" ] } ], "source": [ "print 'Вопрос'.encode('hex')\n", "print repr(u'Вопрос')\n", "print repr(u'Вопрос'.encode('utf-8'))\n", "print u'Вопрос'.encode('cp1251').encode('hex')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В целевой кодировке нужных символов запросто может и не оказаться, в этом случае вылетит `UnicodeEncodeError`. Избежать этого можно дополнительным параметром, сообщающим, как поступить с нетранслирующимся символом: `ignore`, `replace` или `xmlcharrefreplace`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Example: \n", "Example: ????? ???????\n", "Example: اللغة العربية\n" ] } ], "source": [ "s = u'Example: اللغة العربية'\n", "print s.encode('cp1251', 'ignore')\n", "print s.encode('cp1251', 'replace')\n", "print s.encode('cp1251', 'xmlcharrefreplace')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'0JLQvtC/0YDQvtGB\\n'\n" ] } ], "source": [ "print repr('Вопрос'.encode('base64'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Метод `.encode('base64')` имеет неприятную привычку добавлять каждые 76 символов перевод строки. Избежать этого позволит встроенная библиотека `base64`, в которой, кроме обычного base64 реализован еще и url-safe вариант, в котором вместо неподходящих для файловых путей и URL символов `+/` используются `-_`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'0JLQvtC/0YDQvtGB'\n", "'0JLQvtC_0YDQvtGB'\n" ] } ], "source": [ "import base64\n", "print repr(base64.b64encode('Вопрос'))\n", "print repr(base64.urlsafe_b64encode('Вопрос'))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'BZh91AY&SYuy^a\\x00\\x00\\x03\\x00j`\\x00\\x10\\x00\\x00\\x01\\xe0\\x00 \\x001\\x0c\\x00\\x94\\x1az\\x88\\x86\\xc9\\xcd\\xf8\\xbb\\x92)\\xc2\\x84\\x83\\xab\\xca\\xf3\\x08'\n", "'x\\x9c\\xbb0\\xe9\\xc2\\xbe\\x0b\\xfb/6\\\\\\xd8w\\xb1\\x11\\x009\\x9c\\x08\\xb1'\n" ] } ], "source": [ "print repr('Вопрос'.encode('bz2'))\n", "print repr('Вопрос'.encode('zlib'))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Вопрос\n" ] } ], "source": [ "print '=D0=92=D0=BE=D0=BF=D1=80=D0=BE=D1=81'.decode('quoted_printable')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xn--ac3c6jbe0jbbcjd.com\n" ] } ], "source": [ "print 'xn--' + 'Вопрос'.encode('punycode') + '.com'" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "смартхаус\n" ] } ], "source": [ "print '80aa8arcefjq'.decode('punycode')" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pnrfne plcure\n" ] } ], "source": [ "print 'caesar cypher'.encode('rot13')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "307\n" ] } ], "source": [ "import encodings.aliases\n", "print len(encodings.aliases.aliases)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Определить кодировку может помочь библиотека `chardet`, входящая в Anaconda." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'\\x8f\\xe0\\xa8\\xa2\\xa5\\xe2, \\xac\\xa8\\xe0'\n" ] } ], "source": [ "print repr(u'Привет, мир'.encode('cp866'))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'confidence': 0.99, 'language': 'Russian', 'encoding': 'IBM866'}\n", "Привет, мир\n" ] } ], "source": [ "import chardet\n", "bytes = '\\x8f\\xe0\\xa8\\xa2\\xa5\\xe2, \\xac\\xa8\\xe0'\n", "cp = chardet.detect(bytes)\n", "print cp\n", "print bytes.decode(cp['encoding'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для распознавания языка текста можно использовать библиотеку `pycld2`, которая является враппером над библиотекой CLD2, использующейся внутри Chrome для этой цели. Рассмотрим очень известную интернациональную фразу (из Википедии, первый вариант)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install pycld2\n", "```" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pycld2" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "phrases = '''\\\n", "Абхазский: Апролетарцәа атәылақәа ӡегьы рҿы иҟоу, шәҽеидышәкыл!\n", "Аварский: Тlолго дунялалъул хlалтlухъаби, цолъе нуж!\n", "Азербайджанский: Bütün ölkələrin proletarları, birləşin!\n", "Албанский: Punetoret e te gjithe vendeve bashkohuni!\n", "Английский: Workers of the world, unite!\n", "Арабский: !يا عمال العالم اتحدوا\n", "Армянский, Восточный: Պրոլետարներ բոլոր երկրների, միացե՜ք։\n", "Армянский, Западный: Բոլոր երկրներու աշխատաւորներ, միացէ՜ք։\n", "Африкаанс: Werkers van alle lande, verenig!\n", "Баскский: Herrialde guztietako proletarioak, elkar zaitezte!\n", "Башкирский: Бөтә илдәрҙең пролетарийҙәре, берләшегеҙ!\n", "Белорусский: Пралетарыі ўсіх краін, яднайцеся!\n", "Бенгальский: দুনিযার মজদুর, এক হও!\n", "Боснийский: Proleteri svih zemalja, ujedinite se!\n", "Болгарский: Пролетарии от всички страни, съединявайте се!\n", "Бурятский: Бүхы оронуудай пролетаринар, нэгэдэгты!\n", "Валлийский: Gweithwyr yr holl wledydd, uno!\n", "Венгерский: Világ proletárjai, egyesüljetek!\n", "Вьетнамский: Vô sản toàn thế giới, đoàn kết lại!\n", "Гаитянский креольский: Travayè nan tout peyi, ini!\n", "Галисийский: Traballadores do mundo, unídevos!\n", "Грузинский: პროლეტარებო ყველა ქვეყნისა, შეერთდით!\n", "Греческий: Προλετάριοι όλων των χωρών, ενωθείτε!\n", "Гуджарати: બધા દેશોમાં કામદાર સંગઠિત!\n", "Датский: Proletarer i alle lande, foren jer!\n", "Иврит: !פועלי כל העולם התאחדו\n", "Идиш: !פראָלעטאריער פון אלע לענדער, פאראייניקט זיך\n", "Индонезийский: Para pekerja di seluruh dunia, bersatulah!\n", "Ирландский: Oibrithe an domhain, aontaigh!\n", "Исландский: Verkamenn allra landa, sameinist!\n", "Испанский: ¡Trabajadores del mundo, uníos!\n", "Итальянский: Lavoratori di tutto il mondo, unitevi!\n", "Казахский: Барлық елдердің пролетарлары, бірігіңдер!\n", "Калмыцкий: Цуг орн-нутгудын пролетармуд, нэгдцхәтн!\n", "Каннада: ಎಲ್ಲಾ ದೇಶಗಳ ಸಹೋದ್ಯೋಗಿಗಳು, ಯುನೈಟ್!\n", "Карачаево-балкарский: Бютеу дунияны пролетарлары, бирлешигиз!\n", "Карельский: Kaikkien maiden proletaarit, liittykää yhteen!\n", "Каталанский: Proletaris de tots els països, uniu-vos!\n", "Китайский (КНР): 全世界无产者,联合起来!\n", "Китайский (Тайвань): 全世界無產者,聯合起來!\n", "Коми: Став мувывса пролетарийяс, отувтчöй!\n", "Корейский: 만국의 노동자여, 단결하라!\n", "Крымскотатарский: Bütün memleketlerniñ proletarları, birleş!\n", "Курдский: Kirêkaranî/karkerên dinya/cîhanê yekgirin/hevgirin!\n", "Киргизский: Бардык өлкөлөрдүн пролетарлары, бириккиле!\n", "Латынь: Laborantes universis terris iunguntur!\n", "Латышский: Visu zemju proletārieši, savienojieties!\n", "Литовский: Visų šalių proletarai, vienykitės!\n", "Македонский: Пролетери од сите земји, обединете се!\n", "Малагасийский: Mpiasa eran’izao tontolo izao, mampiray!\n", "Малайский: Pekerja semua negara, bersatu!\n", "Мальтийский: Ħaddiema tal-pajjiżi kollha, jingħaqdu!!\n", "Марийский: Чыла элласе пролетарий-влак ушныза\n", "Молдавский: Proletari din toate ţările, uniţi-vă!\n", "Монгольский: Орон бүрийн пролетари нар нэгдэгтүн!\n", "Немецкий: Proletarier aller Länder, vereinigt Euch!\n", "Нидерландский: Proletariërs aller landen, verenigt U!\n", "Норвежский, Букмол: Arbeidere i alle land, foren dere!\n", "Норвежский, Нюнорск: Arbeidarar i alle land, samein dykk!\n", "Окситанский: Proletaris de totes los païses, unissètz-vos!\n", "Осетинский: Ӕппӕт бӕстӕты пролетартӕ, баиу ут!\n", "Персидский: کارگران جهان متحد شوید\n", "Польский: Proletariusze wszystkich krajów, łączcie się!\n", "Португальский: Trabalhadores do mundo, uni-vos!\n", "Румынский: Proletari din toate ţările, uniţi-vă!\n", "Сербский: Пролетери свих земаља, уједините се!\n", "Словацкий: Proletári všetkých krajín, spojte sa!\n", "Словенский: Proletarci vse dezel, zdruzite se!\n", "Суахили: Wafanyakazi wa nchi zote, kuungana!\n", "Таджикский: Пролетарҳои ҳамаи мамлакатҳо, як шавед!\n", "Тайский: แรงงานจากทุกประเทศรวมกัน!\n", "Тамильский: அனைத்து நாடுகளின் தொழிலாளர்கள், இணைக்க!\n", "Татарский: Барлык илләрнең пролетарийлары, берләшегез!\n", "Телугу: అన్ని దేశాల వర్కర్స్, ఐక్యం!\n", "Тувинский: Бүгү чурттарның пролетарийлери, каттыжыңар!\n", "Турецкий: Bütün ülkelerin işçileri, birleşin!\n", "Туркменский: Ähli ýurtlaryň proletarlary, birleşiň!\n", "Удмуртский: Вань кунъёсысь пролетарийёс, огазеяське!\n", "Узбекский: Butun dunyo proletarlari, birlashingiz!\n", "Украинский: Пролетарі всіх країн, єднайтеся!\n", "Урду: !،تمام ممالک کے ورکرز متحد\n", "Филиппинский: Mga manggagawa ng mundo, magkaisa!\n", "Финский: Kaikkien maiden proletaarit, liittykää yhteen!\n", "Французский: Prolétaires de tous les pays, unissez-vous!\n", "Хинди: दुनिया के मज़दूरों, एक हों!\n", "Хорватский: Proleteri svih zemalja, ujedinite se!\n", "Чешский: Proletáři všech zemí, spojte se!\n", "Чувашский: Пĕтĕм тĕнчери пролетарисем, пĕрлешĕр!\n", "Шведский: Arbetare i alla länder, förenen eder!\n", "Эсперанто: Proletoj el ĉiuj landoj, unuiĝu!\n", "Эстонский: Kõigi maade proletaarlased, ühinege!\n", "Якутский: Бүтүн дойдулар пролетарийдара, биир буолуҥ!\n", "Японский: 万国の労働者よ、団結せよ!\n", "'''.splitlines()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "93" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(phrases)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Абхазский ('ABKHAZIAN', 'ab', 98, 800.0)\n", "Аварский ('Unknown', 'un', 0, 0.0)\n", "Азербайджанский ('AZERBAIJANI', 'az', 97, 1313.0)\n", "Албанский ('ALBANIAN', 'sq', 97, 1148.0)\n", "Английский ('ENGLISH', 'en', 96, 1289.0)\n", "Арабский ('ARABIC', 'ar', 97, 640.0)\n", "Армянский, Восточный ('ARMENIAN', 'hy', 100, 1024.0)\n", "Армянский, Западный ('ARMENIAN', 'hy', 100, 1024.0)\n", "Африкаанс ('AFRIKAANS', 'af', 96, 495.0)\n", "Баскский ('BASQUE', 'eu', 98, 1755.0)\n", "Башкирский ('BASHKIR', 'ba', 98, 781.0)\n", "Белорусский ('BELARUSIAN', 'be', 98, 614.0)\n", "Бенгальский ('BENGALI', 'bn', 98, 433.0)\n", "Боснийский ('CROATIAN', 'hr', 97, 625.0)\n", "Болгарский ('BULGARIAN', 'bg', 98, 786.0)\n", "Бурятский ('MONGOLIAN', 'mn', 98, 568.0)\n", "Валлийский ('WELSH', 'cy', 96, 1843.0)\n", "Венгерский ('HUNGARIAN', 'hu', 97, 1505.0)\n", "Вьетнамский ('VIETNAMESE', 'vi', 97, 1472.0)\n", "Гаитянский креольский ('HAITIAN_CREOLE', 'ht', 96, 1251.0)\n", "Галисийский ('GALICIAN', 'gl', 97, 682.0)\n", "Грузинский ('GEORGIAN', 'ka', 100, 1024.0)\n", "Греческий ('GREEK', 'el', 100, 1024.0)\n", "Гуджарати ('GUJARATI', 'gu', 100, 1024.0)\n", "Датский ('Unknown', 'un', 0, 0.0)\n", "Иврит ('HEBREW', 'iw', 97, 972.0)\n", "Идиш ('YIDDISH', 'yi', 98, 1113.0)\n", "Индонезийский ('INDONESIAN', 'id', 97, 1348.0)\n", "Ирландский ('IRISH', 'ga', 96, 1624.0)\n", "Исландский ('ICELANDIC', 'is', 96, 448.0)\n", "Испанский ('GALICIAN', 'gl', 96, 648.0)\n", "Итальянский ('ITALIAN', 'it', 97, 802.0)\n", "Казахский ('KAZAKH', 'kk', 98, 727.0)\n", "Калмыцкий ('MONGOLIAN', 'mn', 98, 406.0)\n", "Каннада ('KANNADA', 'kn', 100, 1024.0)\n", "Карачаево-балкарский ('TURKMEN', 'tk', 98, 341.0)\n", "Карельский ('FINNISH', 'fi', 97, 806.0)\n", "Каталанский ('CATALAN', 'ca', 97, 844.0)\n", "Китайский (КНР) ('Chinese', 'zh', 96, 1984.0)\n", "Китайский (Тайвань) ('ChineseT', 'zh-Hant', 96, 1920.0)\n", "Коми ('Unknown', 'un', 0, 0.0)\n", "Корейский ('Korean', 'ko', 97, 3754.0)\n", "Крымскотатарский ('TURKISH', 'tr', 97, 1024.0)\n", "Курдский ('Unknown', 'un', 0, 0.0)\n", "Киргизский ('KYRGYZ', 'ky', 98, 682.0)\n", "Латынь ('LATIN', 'la', 97, 512.0)\n", "Латышский ('LATVIAN', 'lv', 97, 1448.0)\n", "Литовский ('LITHUANIAN', 'lt', 97, 1328.0)\n", "Македонский ('MACEDONIAN', 'mk', 98, 1008.0)\n", "Малагасийский ('MALAGASY', 'mg', 97, 1076.0)\n", "Малайский ('INDONESIAN', 'id', 96, 1483.0)\n", "Мальтийский ('MALTESE', 'mt', 97, 1923.0)\n", "Марийский ('Unknown', 'un', 0, 0.0)\n", "Молдавский ('ROMANIAN', 'ro', 97, 998.0)\n", "Монгольский ('MONGOLIAN', 'mn', 98, 657.0)\n", "Немецкий ('GERMAN', 'de', 97, 599.0)\n", "Нидерландский ('DUTCH', 'nl', 97, 646.0)\n", "Норвежский, Букмол ('NORWEGIAN', 'no', 97, 899.0)\n", "Норвежский, Нюнорск ('NORWEGIAN_N', 'nn', 97, 526.0)\n", "Окситанский ('OCCITAN', 'oc', 97, 734.0)\n", "Осетинский ('Unknown', 'un', 0, 0.0)\n", "Персидский ('PERSIAN', 'fa', 97, 755.0)\n", "Польский ('POLISH', 'pl', 97, 1706.0)\n", "Португальский ('PORTUGUESE', 'pt', 96, 660.0)\n", "Румынский ('ROMANIAN', 'ro', 97, 998.0)\n", "Сербский ('SERBIAN', 'sr', 98, 1008.0)\n", "Словацкий ('SLOVAK', 'sk', 97, 1356.0)\n", "Словенский ('SLOVENIAN', 'sl', 97, 496.0)\n", "Суахили ('SWAHILI', 'sw', 97, 1234.0)\n", "Таджикский ('TAJIK', 'tg', 98, 1096.0)\n", "Тайский ('THAI', 'th', 100, 1024.0)\n", "Тамильский ('TAMIL', 'ta', 100, 1024.0)\n", "Татарский ('TATAR', 'tt', 98, 793.0)\n", "Телугу ('TELUGU', 'te', 100, 1024.0)\n", "Тувинский ('Unknown', 'un', 0, 0.0)\n", "Турецкий ('TURKISH', 'tr', 97, 1459.0)\n", "Туркменский ('TURKMEN', 'tk', 97, 1219.0)\n", "Удмуртский ('Unknown', 'un', 0, 0.0)\n", "Узбекский ('UZBEK', 'uz', 97, 835.0)\n", "Украинский ('UKRAINIAN', 'uk', 98, 759.0)\n", "Урду ('URDU', 'ur', 97, 1069.0)\n", "Филиппинский ('TAGALOG', 'tl', 97, 1303.0)\n", "Финский ('FINNISH', 'fi', 97, 806.0)\n", "Французский ('FRENCH', 'fr', 97, 1047.0)\n", "Хинди ('HINDI', 'hi', 98, 587.0)\n", "Хорватский ('CROATIAN', 'hr', 97, 625.0)\n", "Чешский ('CZECH', 'cs', 97, 848.0)\n", "Чувашский ('Unknown', 'un', 0, 0.0)\n", "Шведский ('SWEDISH', 'sv', 97, 673.0)\n", "Эсперанто ('ESPERANTO', 'eo', 97, 1086.0)\n", "Эстонский ('ESTONIAN', 'et', 97, 525.0)\n", "Якутский ('Unknown', 'un', 0, 0.0)\n", "Японский ('Japanese', 'ja', 97, 2720.0)\n" ] } ], "source": [ "for phrase in phrases:\n", " lang, text = phrase.split(': ')\n", " print lang, pycld2.detect(text)[2][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CLD2 не угадал языки (вместо этого решил, что): аварский, боснийский (хорватский), бурятский (монгольский), датский, испанский (галисийский), калмыцкий (монгольский), карачаево-балкарский (туркменский), карельский (финский), коми, крымско-татарский (турецкий), курдский, малайский (индонезийский), марийский, молдавский (румынский), осетинский, тувинский, удмуртский, чувашский, якутский.\n", "\n", "74/93." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Расстояние Левенштейна (библиотека `python-Levenshtein`) и генерирование диффов (дельт) над строками (встроенная библиотека `difflib` и дающая обычно лучшие результаты библиотека `python-finediff`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install python-Levenshtein\n", "```" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n", "[('delete', 0, 0), ('replace', 2, 1), ('insert', 6, 5)]\n" ] } ], "source": [ "import Levenshtein\n", "print Levenshtein.distance('abcdef', 'badefo')\n", "print Levenshtein.editops('abcdef', 'badefo')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['+ b', ' a', '- b', '- c', ' d', ' e', ' f', '+ o']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import difflib\n", "list(difflib.Differ().compare('abcdef', 'badefo'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install https://github.com/sharpden/python-finediff/tarball/master\n", "```" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'abcadefo'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import finediff\n", "finediff.FineDiff('abcdef', 'badefo').renderDiffToHTML()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def show_diff(A, B, engine='finediff', format='html'):\n", " def compact_difflib_diff(A, B):\n", " import difflib\n", " prev_type = ''\n", " s = []\n", " for typ, _, ch in difflib.Differ().compare(A, B):\n", " if prev_type != typ and prev_type:\n", " yield prev_type, ''.join(s)\n", " s = []\n", " prev_type = typ\n", " s.append(ch)\n", " if s:\n", " yield prev_type, ''.join(s)\n", " \n", " def compact_finediff_diff(A, B):\n", " import finediff\n", " fd = finediff.FineDiff(A, B)\n", " in_offset = 0\n", " result = []\n", " for edit in fd.getOps():\n", " n = edit.getFromLen()\n", " if isinstance(edit, finediff.FineDiffCopyOp):\n", " yield ' ', A[in_offset : in_offset + n]\n", " elif isinstance(edit, finediff.FineDiffDeleteOp):\n", " yield '-', A[in_offset : in_offset + n]\n", " elif isinstance(edit, finediff.FineDiffInsertOp):\n", " yield '+', edit.getText()\n", " else: # elif isinstance(edit, finediff.FineDiffReplaceOp):\n", " yield '-', A[in_offset : in_offset + n]\n", " yield '+', edit.getText()\n", " in_offset += n\n", " \n", " if engine == 'finediff':\n", " generator = compact_finediff_diff\n", " elif engine == 'difflib':\n", " generator = compact_difflib_diff\n", "\n", " if format == 'html':\n", " import IPython, cgi\n", " html = []\n", " for typ, s in generator(A, B):\n", " if typ == ' ':\n", " html.append(cgi.escape(s))\n", " elif typ == '-':\n", " html.append('' + cgi.escape(s) + '')\n", " elif typ == '+':\n", " html.append('' + cgi.escape(s) + '')\n", " html = ''.join(html)\n", " html = '''\\\n", " \n", "
%s
\n", " ''' % html\n", " IPython.display.display(IPython.display.HTML(html))\n", " elif format == 'text':\n", " from colorama import Fore, Back, Style\n", " result = []\n", " for typ, s in generator(A, B):\n", " if typ == ' ':\n", " result.append(s)\n", " elif typ == '-':\n", " result.append(Fore.YELLOW + Back.RED + s + Style.RESET_ALL + Fore.BLACK)\n", " elif typ == '+':\n", " result.append(Fore.YELLOW + Back.GREEN + s + Style.RESET_ALL + Fore.BLACK)\n", " print ''.join(result)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "
abcadefo
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_diff('abcdef', 'badefo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Библиотека `bsdiff4` заточена под создание бинарных патчей и аналогична утилите `bsdiff`. Для коротких строк она не так эффективна" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install bsdiff4\n", "```" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import bsdiff4" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"BSDIFF40)\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0e\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x06\\x00\\x00\\x00\\x00\\x00\\x00\\x00BZh91AY&SY\\xd4\\x05\\xb3\\x0c\\x00\\x00\\x01@\\x00O\\x00 \\x000\\xcd\\x00\\xc3A\\x81'7\\x17rE8P\\x90\\xd4\\x05\\xb3\\x0cBZh9\\x17rE8P\\x90\\x00\\x00\\x00\\x00BZh91AY&SY\\x1b\\xe2\\r\\x96\\x00\\x00\\x00\\x81\\x007\\x00\\xa0\\x00!\\x80\\x0c\\x01g.\\xe2\\xeeH\\xa7\\n\\x12\\x03|A\\xb2\\xc0\"" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "patch = bsdiff4.diff('abcdef', 'badefo')\n", "patch" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'badefo'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bsdiff4.patch('abcdef', patch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python поддерживает регулярные выражения, похожие на PCRE, встроенной библиотекой `re`. Регулярные выражения нужны, чтобы разбирать или искать строки по образцу, части которого соответствуют классу символов, их последовательности, нескольким альтернативам, повторению некоторого шаблона. Строковый литерал вида `r'...'` позволяет не эскейпить бэкслеши, не считая случая, когда он последний в строке. https://docs.python.org/2/library/re.html#regular-expression-syntax" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Проверка соответствия шаблону три цифры, дефис, две цифры, дефис, две цифры и выделение этих цифровых групп" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('222', '33', '44')\n", "None\n" ] } ], "source": [ "rx = r'^(\\d{3})-(\\d{2})-(\\d{2})$'\n", "print re.match(rx, '222-33-44').groups()\n", "print re.match(rx, '22-333-44')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Поиск в «Алисе в стране чудес» слов от 13 букв и больше" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk\n", "alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'conversations',\n", " u'disappointment',\n", " u'Multiplication',\n", " u'inquisitively',\n", " u'uncomfortable',\n", " u'uncomfortable',\n", " u'circumstances',\n", " u'contemptuously',\n", " u'extraordinary',\n", " u'straightening',\n", " u'uncomfortable',\n", " u'contemptuously',\n", " u'extraordinary',\n", " u'uncomfortable',\n", " u'affectionately',\n", " u'uncomfortably']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(r'\\w{13,}', alice)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Замена вхождения на строку, на строку с захваченными подмасками (нумероваными или именоваными), на строку, строку, упоминающую подмаску, результат функции от вхождения" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "? km is about ? yards\n", "[2] (km) is about [2187] (yards)\n", "[2] (km) is about [2187] (yards)\n", "4 km is about 4374 yards\n" ] } ], "source": [ "s = '2 km is about 2187 yards'\n", "print re.sub(r'\\d+', '?', s)\n", "print re.sub(r'(\\d+) (\\w+)', r'[\\1] (\\2)', s)\n", "print re.sub(r'(?P\\d+) (?P\\w+)', r'[\\g] (\\g)', s)\n", "print re.sub(r'\\d+', lambda match: str(int(match.group()) * 2), s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Поиск вхождений, а не только совпавших подстрок" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('2 km', '2', 'km')\n", "{'count': '2', 'dimension': 'km'}\n", "(0, 4)\n", "\n", "('2187 yards', '2187', 'yards')\n", "{'count': '2187', 'dimension': 'yards'}\n", "(14, 24)\n", "\n", "('5 kcal', '5', 'kcal')\n", "{'count': '5', 'dimension': 'kcal'}\n", "(26, 32)\n", "\n", "('20920 joules', '20920', 'joules')\n", "{'count': '20920', 'dimension': 'joules'}\n", "(42, 54)\n", "\n" ] } ], "source": [ "s = '2 km is about 2187 yards, 5 kcal is about 20920 joules'\n", "for match in re.finditer(r'(?P\\d+) (?P\\w+)', s):\n", " print (match.group(), match.group('count'), match.group('dimension'))\n", " print match.groupdict()\n", " print match.span()\n", " print" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удобным онлайн-инструментом отладки регекспов может служить https://regex101.com/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Стандартная библиотека `re` в случае подмасок, вложенных в повторяющуюся, захватывает только последнюю. Избежать проблемы можно с помощью библиотеки `regex`, которая входит в Anaconda и способна полностью заменить `re` без переписывания кода. Ключ `re.M` (MULTILINE) позволяет рассматривать `^` и `$` как начало и конец одной строки, а не всего текста" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import regex" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['4.6692', '2.5029']\n", "['3.14', '2.71828', '1.414']\n", "['1.618']\n" ] } ], "source": [ "s = '''\n", "4.6692 2.5029 end\n", "3.14 2.71828 1.414 end\n", "1.618 end\n", "'''\n", "for match in regex.finditer('^(((\\d+)\\.(\\d+)) )*end$', s, regex.M):\n", " print match.captures(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Произвольную грамматику можно описать в БНФ и распарсить Earley/LALR(1)/CYK-парсером из библиотеки `lark-parser`. [Синтаксис грамматики](https://lark-parser.readthedocs.io/en/latest/grammar/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install lark-parser\n", "```" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from lark import Lark, InlineTransformer\n", "\n", "grammar = '''\n", "?sum: product\n", " | sum \"+\" product -> add\n", " | sum \"-\" product -> sub\n", "\n", "?product: item\n", " | product \"*\" item -> mul\n", " | product \"/\" item -> div\n", "\n", "?item: /[\\d.]+/ -> number\n", " | \"-\" item -> neg\n", " | \"(\" sum \")\"\n", "\n", "SPACE.ignore: /\\s+/\n", "'''\n", "\n", "parser = Lark(grammar, start='sum')\n", "\n", "class CalculateTree(InlineTransformer):\n", " from operator import add, sub, mul, truediv as div, neg\n", " number = float\n", "\n", "def calc(expr):\n", " return CalculateTree().transform( parser.parse(expr) )" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "38.0" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calc('3+5*7')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Tree(add, [Tree(number, [Token(ANONRE_0, '3')]), Tree(mul, [Tree(number, [Token(ANONRE_0, '5')]), Tree(number, [Token(ANONRE_0, '7')])])])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parser.parse('3+5*7')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "add\n", " number\t3\n", " mul\n", " number\t5\n", " number\t7\n", "\n" ] } ], "source": [ "print parser.parse('3+5*7').pretty()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для работы со строками может пригодиться структура данных префиксное дерево aka бор aka trie, реализованная в библиотеке `datrie`, входящей в Anaconda. Префиксное дерево подобно словарю, но позволяет быстро искать в нем ключи, которые являются префиксами некоторой строки, либо ключи, для которых некоторая строка — префикс." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from datrie import Trie\n", "import string\n", "\n", "import nltk\n", "alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')\n", "\n", "trie = Trie(string.printable)\n", "for item in set(nltk.tokenize.word_tokenize(alice)):\n", " trie[item] = True" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'won', u\"won't\", u'wonder', u'wondered', u'wonderful', u'wondering']" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trie.keys(u'won')" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 6.72 times longer than the fastest. This could mean that an intermediate result is being cached \n", "100000 loops, best of 3: 6.19 µs per loop\n" ] } ], "source": [ "%%timeit\n", "trie.keys(u'won')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Даты и время" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Упоминаются библиотеки `time`, `datetime`, `python-dateutil`, `pytz`, `timelib`, `parsedatetime`." ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import time" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 618 ms\n" ] } ], "source": [ "%%time\n", "time.sleep(0.618)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from datetime import datetime, timedelta" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "datetime.datetime(2019, 7, 21, 20, 7, 1, 176000)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "now = datetime.now()\n", "now" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Дату можно превратить в строку согласно форматной строке, описанной в https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-07-21T20:07:01.176000\n", "2019 6\n", "2019.07.21 20:07:01\n", "Sunday\n" ] } ], "source": [ "print now.isoformat()\n", "print now.year, now.weekday()\n", "print now.strftime('%Y.%m.%d %H:%M:%S')\n", "print now.strftime('%A')" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-08-10 20:07:01.176000\n" ] } ], "source": [ "print now + timedelta(days=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Обратите внимание, что `datetime.now()` не знает о таймзоне, в которой оно получено. Чтобы это исправить, нужно использовать входящие в состав Anaconda библиотеки `python-dateutil` и `pytz`." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "datetime.datetime(2019, 7, 21, 20, 7, 1, 202000, tzinfo=tzlocal())\n", "2019-07-21 20:07:01.202000+03:00\n", "3:00:00\n" ] } ], "source": [ "import dateutil, dateutil.tz\n", "now = datetime.now(dateutil.tz.tzlocal())\n", "print repr(now)\n", "print now\n", "print now.utcoffset()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pytz" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-07-21 20:07:01.225000\n", "2019-07-21 20:07:01.225000+03:00\n", "2019-07-21 10:07:01.225000-07:00\n" ] } ], "source": [ "naive_now = datetime.now()\n", "its_in_kiev = pytz.timezone('Europe/Kiev').localize(naive_now)\n", "now_in_la = its_in_kiev.astimezone(pytz.timezone('America/Los_Angeles'))\n", "print naive_now\n", "print its_in_kiev\n", "print now_in_la" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Всегда предпочитайте UTC для обработки времени, настройки серверов, используя поясное время только для человеко-читаемого вывода.\n", "\n", "Не забывайте, что самым осмысленным форматом даты является Unix timestamp, число секунд с 1 января 1970 UTC." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1563728821.202" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "epoch = datetime.fromtimestamp(0, dateutil.tz.tzutc())\n", "(now - epoch).total_seconds()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1563728821.357" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "time.time()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Библиотека `timelib` это обертка над примерно полумегабайтовым куском кода из реализации функции `strtotime` в PHP, который способен распарсить почти любую упячку, похожую на дату, что сильно удобнее чем `datetime.strptime`, которой нужно явно указать форматную строку.\n", "\n", "Другая библиотека (входит в Anaconda) `python-dateutil` https://dateutil.readthedocs.io/en/stable/examples.html#parse-examples способна решать более широкий круг задач и поддерживает таймзоны, но имеет меньший набор (впрочем, не строго меньший) понимаемых форматов даты.\n", "\n", "Третья библиотека `parsedatetime` заточена под парсинг дат, но не всегда возвращает осмысленный результат." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "Из VC++2008 command prompt\n", "pip install timelib\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "pip install parsedatetime\n", "```" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import timelib\n", "import dateutil.parser\n", "import parsedatetime\n", "import tabulate, IPython" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dates = '''\\\n", "May 21, 2016\n", "15/05/2016\n", "15.05.2016 00:03:05+4:00\n", "friday\n", "2012/12/30 05:20:21+2\n", "now -1 month\n", "next Easter\n", "+1d\n", "previous Friday\n", "week ago\n", "last Saturday\n", "08 Mar 1908\n", "1917-07-11 21:40\n", "'''.splitlines()" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
string timelib.strtodatetime dateutil.parser.parse parsedatetime.Calendar().parseDT
May 21, 2016 2016-05-21 00:00:00 2016-05-21 00:00:00 2016-05-21 20:07:01
15/05/2016 Unexpected character (while parsing date '15/05/2016') 2016-05-15 00:00:00 2019-07-21 20:07:01
15.05.2016 00:03:05+4:002016-05-15 00:03:05 2016-05-15 00:03:05+04:00 2019-07-21 00:03:05
friday 2019-07-26 00:00:00 2019-07-26 00:00:00 2019-07-26 20:07:01
2012/12/30 05:20:21+2 2012-12-30 05:20:21 2012-12-30 05:20:21+02:00 2012-12-30 05:20:21
now -1 month 2019-06-21 17:07:01 (u'Unknown string format:', 'now -1 month') 2019-06-21 20:07:01
next Easter The timezone could not be found in the database (while parsing date 'next Easter')(u'Unknown string format:', 'next Easter') 2019-07-21 20:07:01
+1d 2019-07-21 17:07:01 (u'Unknown string format:', '+1d') 2019-07-22 20:07:01
previous Friday 2019-07-19 00:00:00 (u'Unknown string format:', 'previous Friday')2019-07-19 09:00:00
week ago The timezone could not be found in the database (while parsing date 'week ago') (u'Unknown string format:', 'week ago') 2019-07-21 20:07:01
last Saturday 2019-07-20 00:00:00 (u'Unknown string format:', 'last Saturday') 2019-07-20 09:00:00
08 Mar 1908 1908-03-08 00:00:00 1908-03-08 00:00:00 1908-03-08 20:07:01
1917-07-11 21:40 1917-07-11 21:40:00 1917-07-11 21:40:00 1917-07-11 21:40:00
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def parse_date(date, parser):\n", " try:\n", " return parser(date)\n", " except Exception as e:\n", " return str(e)\n", "\n", "table = [[\n", " date,\n", " parse_date(date, timelib.strtodatetime),\n", " parse_date(date, dateutil.parser.parse),\n", " parse_date(date, lambda s: parsedatetime.Calendar().parseDT(s)[0]),\n", "] for date in dates]\n", "headers=['string', 'timelib.strtodatetime', 'dateutil.parser.parse', 'parsedatetime.Calendar().parseDT']\n", "IPython.display.display(IPython.display.HTML(tabulate.tabulate(table, headers=headers, tablefmt='html')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Последовательные коллекции" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Упоминаются библиотеки `collections`, `enum34`, `bisect`, `heapq`, `itertools`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Класс `namedtuple` позволяет именовать элементы кортежей." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import namedtuple" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rec(a=3, b=4, c=5)\n", "4 4\n", "3\n", "rec(a=3, b=4, c=11)\n", "3 4 5\n", "[3, 4, 5]\n" ] } ], "source": [ "rec = namedtuple('rec', 'a b c')\n", "r = rec(3, 4, 5)\n", "print r\n", "print r.b, r[1]\n", "print len(r)\n", "print r._replace(c=11)\n", "a, b, c = r\n", "print a, b, c\n", "print list(r)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Другим способом давать имена числам является бэкпортированная из Python 3 библиотека `enum34`, входящая в состав Anaconda" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from enum import IntEnum" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class Suit(IntEnum):\n", " HEARTS = 0\n", " CLUBS = 1\n", " DIAMS = 2\n", " SPADES = 3" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "♥ 0 Suit.HEARTS\n", "♣ 1 Suit.CLUBS\n", "♦ 2 Suit.DIAMS\n", "♠ 3 Suit.SPADES\n" ] } ], "source": [ "for suit in Suit:\n", " print u'♥♣♦♠'[suit], suit.value, suit" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Suit(Suit.CLUBS + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Бинарный поиск в отсортированном массиве (аналог `std::lower_bound`, `std::upper_bound` из C++) реализован в стандартной библиотеке `bisect`. Необязательные параметры позволяют указывать кусок массива." ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 8, 8, 9, 9, 9]\n", "4 8\n" ] } ], "source": [ "import bisect\n", "arr = sorted(map(int, list('314159265358979323846')))\n", "print arr\n", "print bisect.bisect_left(arr, 3), bisect.bisect_right(arr, 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Бинарная куча (очередь с приоритетами, `std::priority_queue`) реализована в стандартной библиотеке `heapq`. Это не отдельный контейнер, а функции для работы со списком, сохраняющим инвариант. Для ассоциирования значения с приоритетом можно использовать `tuple` и их сортировку по умолчанию." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3, 6, 15, 92, 14]\n", "3\n", "[6, 14, 15, 92]\n", "[6, 14, 15, 92, 535]\n" ] } ], "source": [ "import heapq\n", "h = [3, 14, 15, 92, 6]\n", "heapq.heapify(h)\n", "print h\n", "print heapq.heappop(h)\n", "print h\n", "heapq.heappush(h, 535)\n", "print h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Функция `heapq.merge` сливает несколько отсортированных итерируемых последовательностей в одну." ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[1, 2, 2, 3, 3, 3, 4, 4, 5]" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(heapq.merge([1,2,3],[2,3,4],[3,4,5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Дек (аналог `std::deque`) это структура, обобщающая стек и очередь, в нее можно вставлять и вынимать элементы с обоих концов. Если выполняются только эти операции, дек работает значительно быстрее списка, который выполняет операции с левым концом за $O(N)$." ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import deque" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "deque([0, 1, 2, 3])\n", "3\n", "deque([0, 1, 2])\n", "0\n", "deque([1, 2])\n" ] } ], "source": [ "d = deque()\n", "d.append(1)\n", "d.append(2)\n", "d.append(3)\n", "d.appendleft(0)\n", "print d\n", "print d.pop()\n", "print d\n", "print d.popleft()\n", "print d" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 4 ms\n" ] } ], "source": [ "%%time\n", "N = 10000\n", "result_deque = []\n", "d = deque()\n", "for i in range(N):\n", " if i % 3 == 0:\n", " d.append(i)\n", " else:\n", " d.appendleft(i)\n", "for i in range(N):\n", " if i % 2 == 0:\n", " result_deque.append(d.pop())\n", " else:\n", " result_deque.append(d.popleft())" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 213 ms\n" ] } ], "source": [ "%%time\n", "N = 10000\n", "result_list = []\n", "d = []\n", "for i in range(N):\n", " if i % 3 == 0:\n", " d.append(i)\n", " else:\n", " d.insert(0, i)\n", "for i in range(N):\n", " if i % 2 == 0:\n", " result_list.append(d[-1])\n", " d = d[:-1]\n", " else:\n", " result_list.append(d[0])\n", " d = d[1:]" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_deque == result_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Библиотека `itertools` позволяет итерироваться по всякому по коллекциям и итерируемым объектам, втч генераторам. Например, объединять несколько итерируемых объектов, просматривать все перестановки, декартовы произведения, сочетания." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import itertools" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1 2 0 1\n" ] } ], "source": [ "for x in itertools.chain(range(3), range(2)):\n", " print x," ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 'a')\n", "(1, 'b')\n", "(2, 'a')\n", "(2, 'b')\n", "(3, 'a')\n", "(3, 'b')\n" ] } ], "source": [ "for x in itertools.product([1, 2, 3], 'ab'):\n", " print x" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('0', '0', '0'),\n", " ('0', '0', '1'),\n", " ('0', '1', '0'),\n", " ('0', '1', '1'),\n", " ('1', '0', '0'),\n", " ('1', '0', '1'),\n", " ('1', '1', '0'),\n", " ('1', '1', '1')]" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(itertools.product('01', repeat=3))" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 2, 3)\n", "(1, 3, 2)\n", "(2, 1, 3)\n", "(2, 3, 1)\n", "(3, 1, 2)\n", "(3, 2, 1)\n" ] } ], "source": [ "for x in itertools.permutations([1,2,3]):\n", " print x" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 2)\n", "(1, 3)\n", "(2, 3)\n" ] } ], "source": [ "for x in itertools.combinations([1,2,3], 2):\n", " print x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В `itertools` входят методы, расширяющие операции, например, взятие слайса, на итерируемые последовательности (которые могут быть и бесконечными)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('b', 'E'),\n", " ('c', 'F'),\n", " ('d', 'G'),\n", " ('a', 'E'),\n", " ('b', 'F'),\n", " ('c', 'G'),\n", " ('d', 'E'),\n", " ('a', 'F'),\n", " ('b', 'G'),\n", " ('c', 'E')]" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(itertools.izip(\n", " itertools.islice(itertools.cycle('abcd'), 5, 15),\n", " itertools.islice(itertools.cycle('EFG'), 0, 10),\n", "))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ассоциативные коллекции" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Упоминаются библиотеки `collections`, `sortedcontainers`, `toolz`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Множества поддерживают уникальность каждого добавленного элемента" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3 set([1, 2, 3])\n" ] } ], "source": [ "s = set([1,2,2,3,3,3]) # или {1,2,2,3,3,3}\n", "print len(s), s" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True False\n" ] } ], "source": [ "print 2 in s, 4 in s" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{1, 2, 3, 4}" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.add(3)\n", "s.add(4)\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Множества поддерживают теоретико-множественные операции: $A \\cap B$, $A \\cup B$, $A \\setminus B$, $B \\setminus A$, $A \\triangle B$, $\\subset$, $\\subseteq$" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "set(['c', 'd'])\n", "set(['a', 'c', 'b', 'e', 'd', 'g', 'f'])\n", "set(['a', 'b']) set(['e', 'g', 'f'])\n", "set(['a', 'b', 'e', 'g', 'f'])\n", "True False False True True\n" ] } ], "source": [ "a = set('abcd')\n", "b = set('cdefg')\n", "print a & b\n", "print a | b\n", "print a - b, b - a\n", "print a ^ b\n", "print a > set('ab'), a > set('abe'), a > set('abcd'), a >= set('abcd'), a <= set('abcdef')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Коллекция `set` мутабельна, чтобы использовать ее, в качестве ключа словаря или элемента другого множества, поможет `frozenset`" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{frozenset(['a', 'c', 'b', 'd']): 4, frozenset([1, 2, 3]): 3}\n", "set([frozenset([1, 2, 3]), frozenset([4, 5, 6])])\n" ] } ], "source": [ "print {frozenset([1,2,3,3]): 3, frozenset(a): 4}\n", "print {frozenset({1,2,2,3}), frozenset({4,5,6})}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Словарь позволяет ассоциировать с каким-нибудь иммутабельным объектом любой." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "123 [3, 2, 1]\n", "4\n", "3\n", "{(1, 'a'): [3, 2, 1], 5: {}, 'def': 456}\n" ] } ], "source": [ "d = {'abc': 123, 'def': 456, (1, 'a'): [3, 2, 1], 5: {}}\n", "print d['abc'], d[(1, 'a')]\n", "print len(d)\n", "del d['abc']\n", "print len(d)\n", "print d" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'default'" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{'a': 123}.get('b', 'default')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Альтернативно для создания словаря можно использовать синтаксис функции" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'France': 'Paris', 'Italy': 'Rome', 'Spain': 'Madrid'}" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "capitals = dict(France='Paris', Italy='Rome', Spain='Madrid')\n", "capitals" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'France': 'Paris', 'Italy': 'Rome', 'Spain': 'Madrid'}" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "capitals = dict([('France', 'Paris'), ('Italy', 'Rome'), ('Spain', 'Madrid')])\n", "capitals" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Italy Spain France\n", "['Italy', 'Spain', 'France']\n", "['Rome', 'Madrid', 'Paris']\n", "[('Italy', 'Rome'), ('Spain', 'Madrid'), ('France', 'Paris')]\n" ] } ], "source": [ "for k in capitals:\n", " print k,\n", "print\n", "print capitals.keys()\n", "print capitals.values()\n", "print capitals.items()" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [], "source": [ "capitals['Switzerland'] = 'Bern'" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "ename": "KeyError", "evalue": "'Belgia'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mcapitals\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'Belgia'\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;34m'Brussel'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mKeyError\u001b[0m: 'Belgia'" ] } ], "source": [ "capitals['Belgia'] += 'Brussel'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Объединять списки можно `+`, сеты `|`, но для словарей оператора, увы, нет." ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'a': 123, 'b': 234, 'common': 2}" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = {'a': 123, 'common': 1}\n", "d.update({'b': 234, 'common': 2})\n", "d" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'a': 123, 'b': 234, 'common': 2}" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict({'a': 123, 'common': 1}, b=234, common=2)" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'a': 123, 'b': 234, 'common': 2}" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d1 = {'a': 123, 'common': 1}\n", "d2 = {'b': 234, 'common': 2}\n", "dict(d1, **d2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В отличие от `std::map` из C++ или `array` из PHP, `dict` это хэш-таблица, а значит не сохраняет порядок вставки и не поддерживает сортированность ключей. Кроме того, если в C++ обращение к несуществующему элементу `std::map` приводит к созданию объекта по умолчанию, в Python вылетает `KeyError`. Если эти поведения желаемы, надо использовать другие классы словарей" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Коллекция `defaultdict` во всем подобна `dict`, но при обращении к несуществующему ключу конструирует для него значение вызовом функции, переданной в конструктор" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk\n", "alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')\n", "words = nltk.tokenize.word_tokenize(alice)" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import defaultdict" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'yards', u'yawned', u'yawning', u'ye', u'year', u'years', u'yelled', u'yelp', u'yer', u'yes', u'yesterday', u'yet', u'you', u\"you'd\", u\"you've\", u'young', u'your', u'yours', u'yourself', u'youth']\n" ] } ], "source": [ "cnt = defaultdict(set)\n", "for word in words:\n", " cnt[word[0]].add(word)\n", "print sorted(list(cnt['y']))" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2418 ,\n", "1516 the\n", "1127 '\n", " 974 .\n", " 757 and\n", " 717 to\n", " 612 a\n", " 512 it\n", " 506 she\n", " 496 of\n" ] } ], "source": [ "cnt = defaultdict(int)\n", "for word in words:\n", " cnt[word] += 1\n", "for word, occ in sorted(cnt.items(), key=lambda (word, occ): occ, reverse=True)[:10]:\n", " print '%4d' % occ, word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для конкретно такого поведения удобнее использовать `Counter`" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2418 ,\n", "1516 the\n", "1127 '\n", " 974 .\n", " 757 and\n", " 717 to\n", " 612 a\n", " 512 it\n", " 506 she\n", " 496 of\n" ] } ], "source": [ "from collections import Counter\n", "\n", "cnt = Counter(words)\n", "for word, occ in sorted(cnt.items(), key=lambda (word, occ): occ, reverse=True)[:10]:\n", " print '%4d' % occ, word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "или даже (для конретно этой задачи)" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2418 ,\n", "1516 the\n", "1127 '\n", " 974 .\n", " 757 and\n", " 717 to\n", " 612 a\n", " 512 it\n", " 506 she\n", " 496 of\n" ] } ], "source": [ "for word, occ in cnt.most_common(10):\n", " print '%4d' % occ, word" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В сконструированный `Counter` можно добавить список элементов методом `.update`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Counter` можно рассматривать не только как словарь, но и как аналог `std::multiset`. Он поддерживает теоретико-множественные операции" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'b': 3, 'a': 2, 'c': 2})\n", "Counter({'b': 2, 'a': 1})\n", "Counter({'b': 5, 'a': 3, 'c': 1})\n" ] } ], "source": [ "print Counter('aabbcc') | Counter('abbb')\n", "print Counter('aabbc') & Counter('bbba')\n", "print Counter('aabbc') + Counter('bbba')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "К сожалению, сравнение `Counter` реализовано не как в `set` (по включению), а унаследовано от `dict` и бесполезно." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Библиотека `sortedcontainers`, входящая в Anaconda, содержит контейнеры `SortedDict`, `SortedSet`, `SortedList`, аналогичных соответственно `dict`, `set` и `list`, поддерживающие сортированность вставленных элементов." ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sortedcontainers import SortedDict" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "SortedDict(None, 1000, {'France': 'Paris', 'Italy': 'Rome', 'Spain': 'Madrid', 'Switzerland': 'Bern'})" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SortedDict(capitals)" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "SortedSet(['France', 'Italy', 'Spain', 'Switzerland'], key=None, load=1000)" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SortedDict(capitals).keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Обход большого сортированного словаря по `.keys` может занять длительное время, лучше использовать, например, `itertools.islice(d.iterkeys(), 0, 10))`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Коллекция `OrderedDict` поддерживает порядок вставки ключей, который не обновляет при повторной вставке" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Switzerland Spain Italy France\n", "OrderedDict([('Switzerland', 'Bern'), ('Spain', 'Madrid'), ('Italy', 'Rome'), ('France', 'Paris')])\n" ] } ], "source": [ "from collections import OrderedDict\n", "\n", "od = OrderedDict()\n", "for k, v in sorted(capitals.items(), reverse=True):\n", " print k,\n", " od[k] = v\n", "print\n", "print od" ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OrderedDict([('Switzerland', '[Bern]'), ('Spain', '[Madrid]'), ('Italy', '[Rome]'), ('France', '[Paris]')])\n" ] } ], "source": [ "for k, v in sorted(capitals.items()):\n", " od[k] = '[%s]' % v\n", "print od" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Несколько удобных утилитных функций есть в библиотеке `toolz` (входит в Anaconda)" ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import toolz" ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{3: ['May'],\n", " 4: ['June', 'July'],\n", " 5: ['March', 'April'],\n", " 6: ['August'],\n", " 7: ['January', 'October'],\n", " 8: ['February', 'November', 'December'],\n", " 9: ['September']}" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "toolz.groupby(len, 'January February March April May June July August September October November December'.split())" ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'result'" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = {1: {2: {3: {4: 'result'}}}}\n", "toolz.get_in([1, 2, 3, 4], d, 'default')" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'a': 1, 'b': -2, 'c': -3, 'd': 4}" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "toolz.merge({'a': 1, 'b': 2}, {'b': -2, 'c': 3}, {'c': -3, 'd': 4})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Прочие контейнеры" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Библиотека `scipy` (входит в Anaconda), среди прочего, содержит реализацию KD-дерева, позволяющего быстро искать соседние точки" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: pylab import has clobbered these variables: ['text', 'bytes', 'datetime', 'table', 'var', 'rec', 'pi']\n", "`%matplotlib` prevents importing * from pylab and numpy\n" ] } ], "source": [ "%pylab inline" ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [], "source": [ "N = 1000000\n", "random.seed(0)\n", "xx, yy = random.random((2, N))" ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wall time: 701 ms\n" ] } ], "source": [ "%%time\n", "import scipy.spatial\n", "index = scipy.spatial.cKDTree(data=c_[xx, yy])" ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Closest point [112510] = (0.500110, 0.499674) at distance 0.000344\n" ] } ], "source": [ "d, idx = index.query([0.5, 0.5])\n", "print 'Closest point [%d] = (%f, %f) at distance %f' % (idx, xx[idx], yy[idx], d)" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQ8AAAD8CAYAAABpXiE9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnX2QVeWZ4H8PDdG2ABm0FbqBSCaCmi+lGyYrk9A6m9UY\nGEjGyjaZ3UomW1C6oxW1NWknFdlOaiutBpLsmIwL0UmqZgp0VpcImmB2SFNZazXQiI74gcSo0I2R\nfBAhEvl69o97Gm5f7u177rnn433PfX5VXX3vOe97znPuPfc5z9f7vqKqGIZh1MqYrAUwDMNPTHkY\nhhEJUx6GYUTClIdhGJEw5WEYRiRMeRiGEQlTHoZhRMKUh2EYkTDlYRhGJMZmLUAtnHvuuXrBBRdk\nLYZRAwcPHmTChAmxH/fFNw5y9PiJ07aPaxrDRVMmhG5jjGRgYODXqtoSpq1XyuOCCy5g27ZtWYvh\nHOufHuTuTS8xdOAwrZOaue2q2Sy5rC1rsQAYGhqitbW17uOUXuO5Bw6XbSfAtr5PnOxz+8P/xuGj\nx0/ubx7XxNc/9QFnPh/XEJHXwrb1SnkYp1P6Axk8cJjbH/43gNz8QMpdowDlRmW1Tmo++Xr4+l1V\nrL5jysNz7t700ognK8Dho8e5e9NLTvxI1qxZw4oVK+o6RrlrVDhNgTSPa+K2q2aPaLfksjYnPoc8\nYgFTzxmqYL5X2u4jla5FgbZJzUjw39yRdDHLw3NaJzUzWObHVWy++06la2yb1MwTPVdmIFHyuBzH\nGsYsD8+57arZNI9rGrGtnPmeFQsWLKj7GK5fY9wMx3gGDxxGORXHWv/0YNaijcCUh+csuayNr3/q\nA86a752dnXUfw/VrjJvR4lguYW6LQ0Q1VV0OCq5cuZLu7u66j1N6jeufHmR+32anzfqo+BLHMuXh\nCHlNuR46dCj2Y+b1sxrGlziWl27L8FNnZs+jzO/b7JwvWAvD13LTAzu8MFVdwBezPiq+xHi8szzy\n9NQpVwFZimumaq1MnTo19mP6YtZHxZfiNu+Uh+tFUbVQ7lpKcc1UrZXly5dH7lspBuSLWV9KLTEt\nl+NYw3jntuTpqVNNZhdN1VrZsGFDpH6jpSt9MeuL8SX9WgveKY9KTxfXnzrlGE3mvKQjt2/fHqlf\nNQvTt9RtHuM03rktt101u+xISZefOpWodC2u/xDSoJqF6YNZX0yeLOZhvLM8fHzqVCJP1xI3ebIw\nIX/XAyA+LTfZ0dGhNp+HX0SdDMi1uTjqHWvi2vVUQkQGVLUjTNtQloeIXC0iL4nIbhHpKbO/U0R+\nLyI7gr87aujbLSIqIueGkcXwi6GhoUj9XLLK4gh2unQ9cVHV8hCRJmAX8DFgL7AVWKqqzxe16QRu\nVdWFtfQVkenA94CLgHZV/fVospjl4R+9vb11z+eRNfP7NjfMqN5aLI8wAdN5wG5VfSU4+DpgMfD8\nqL3C9f0m8EXgh2GEbUR8GJqdd/IY7IyDMG5LG7Cn6P3eYFspl4vIsyLyIxF5X7W+IrIYGFTVZ2oX\nuzHIY22Aj+Qx2BkHcWVbtgMzVPWDwN8D60drLCJnAX8H3DFau6DtchHZJiLb9u/fH4uwvpCH2oCF\nCxdWb+Q4PhalpUEY5TEITC96Py3YdhJVfUtVDwWvHwPGBQHQSn3/FJgJPCMirwbbt4vIlNKTq+pq\nVe1Q1Y6WllAzwueGPJjL7e3tWYtQN3kMdsZBmJjHVuBCEZlJ4YffBXymuEHwo/+VqqqIzKOglH4D\nHCjXV1V3AucV9X8V6KgWMI0b1+MJvo7hKCYPAVPwrygtDapaHqp6DLgB2AS8ADyoqjtF5DoRuS5o\ndi3wnIg8A/wPoEsLlO2bxIXUig/xBDOXDZcJVZ4euCKPlWy7t+j1PcA9YfuWaXNBGDnixIfRub4M\nzTYaE+/GtsSFL/EE383lWbNmZS2CkRANqzxciSfUGndxPU5TytKlS7MWwUgI7wbGxUW5eIJQiH2k\nNbVhrXGXJOM0SU3tuHbt2liOY0QjySk7G1Z5FKffYOTShWkFT2ut40iq7iNJpbRr1666j2FEI+mk\nQMMqDygokCd6rqRtUvNpiybX86MMq+1rjbskFafJQzGacTpJf68NrTyGifNHWYu2r7XsOakyaV+C\nx0ZtJP29mvIg3h9lLdq+1jqOpOo+khy7kYcCMV9JekyOKQ/i/VHWou1rLXtOqkw6yWK0gYGBuo+R\nB7JYayjpIsOGTdUWE2cxVq0p4FrrOJKo+0iyGG3jxo25GN9SD1mtNZR0kaEpj4C4fpS+TtDsezGa\ny2RZzZzk92rKI2aspNwoJa8BaVMeCWBP8VN0dXVlLULmuFLNHDdeBUwPvH00NwtcNwqtra1Zi5A5\neR0d7ZXyGDxw2Okh9KORRbTdBVatWpW1CJmT18mEvHJbTpTM9O7aEPpKZBVtN9whj66sV5ZHOXwI\nOln5t5FHvFcePgSd8hptD8OcOXOyFsFICK/cljEiI977EnSKGm1Pa+6OJM+zaNGiWI5juIdXlkfb\npGYvg05Rou1pzbGa9HlWr15dt3yNGGj2Aa8sj0lnjfNyeb8ohWNpVSUmfZ59+/ZF7muBZrfxSnn4\nTK3R9rTiJC7HY3yYpLqRMeXhKGlVJSZ9nvHjx0fu67Ji84Uk41lexTwaibSqEpM+T3d3d+S+tkZs\nfdg0hA1KWlWJSZ+nv78/ct+8lnWnRdL1Rea2OExaVYlJnmfLli10dnZG6msjlOsjabfPlIfhNHks\n606LpONZ5rYYRk6xaQgNr1m2bFnWIjQsNg2hkTm+LXFpnCJJt8/cFmNU6k33rVmzJlkBjcww5WGM\nik0nYFTC3JYGoB63w6o8jUqY5ZFz6nU76q3yXLBgQVhRDc8w5ZFz6nU76k33RS0QM9zHlEfOqdft\nqLd8feXKlWFFNTwjVMxDRK4Gvg00Ad9T1b6S/Z3AD4FfBpseVtWvjtZXRO4GFgFHgF8Af6OqB+q9\nIGMkcVQZ1pPuO3ToUKR+hvtUtTxEpAn4DvBx4BJgqYhcUqbpz1T10uDvqyH6/gR4v6p+ENgF3F73\n1Rin4dvgMps5zB/CuC3zgN2q+oqqHgHWAYtDHr9iX1V9XFWPBe2eBKbVJroRhqzXDJk6dWrotmlN\nvWjEQxi3pQ3YU/R+L/BnZdpdLiLPAoPAraq6s4a+nwceCCWxUTNZDi5bvnx56LY2c5hfxBUw3Q7M\nCFyQvwfWh+0oIl8GjgH/XGH/chHZJiLb9u/fH4uwRnps2LAhdFurKfGLMMpjEJhe9H5asO0kqvqW\nqh4KXj8GjBORc6v1FZHPAQuBv1YtWQ7u1LFXq2qHqna0tLSEEDff+BYT2L59e+i2NnOYX4RRHluB\nC0Vkpoi8C+gCHiluICJTRAqLqojIvOC4vxmtb5CF+SLwl6r6dlwX5DPVFEPeYwK+BXcbnaoxD1U9\nJiI3AJsopFvvV9WdInJdsP9e4FrgehE5BhwGugJLomzf4ND3AGcAPwn0zpOqel28l+cPYZYZyHtM\nwGYO8wup4C04SUdHh27bti1rMRJhft/msvUYbZOaT65VM7PnUcp9WwL8su8TichV73D8gwcPMmHC\nhERkM+JHRAZUtSNMW6swdYQwwcK0YwJxuElDQ0OJyGZkjykPRwijGNKOCcQxHH/dunVxi2U4gikP\nRwijGNIu+LLUqTEaNp+HI4QNFqZZ8JXWqnWGn5jycAjXlhm47arZIzJAULubtHDhwiREM2KiNCA+\npnni5LB9TXmUwfcJf+OSP47UaXt7e83nNdKhXHnA2Ikt7w7b35RHCWHqLVwmbvnrtYZ6e3tZsWJF\n5P5GcpQLiCMSOg5qAdMSfJ/w13f5jfSoN/BtyqME3zMMvstvpEe9gW9THiX4PjjLNflnzZqVyXmN\n6pQrD0D1RNj+pjxK8H1wlmvyL126NJPzGtUpVzd07K39r4Xtb2NbymDZlviOu3btWlMgHlHL2BbL\ntgT4rjCKSaJeJGoWZ9euXbHKYbiDKQ/8T8+mQd6nA8jTwyMtcqc8otwEef9hxEGeszj28IhGrgKm\nUYeQ5/mHERdRszg+FIhZbUw0cqU8ot4ErqU3XSRqFmdgYCBJsWKh0kOi3KBA4xS5Uh5RLQjX0psu\nEnU6gI0bN6YjYB1UekgI5GZ+2CTIVcwj6hBymzszHK6N+o2L266azc0P7DhtikcFi3uNQq6URz1D\nyPP6wzCqs+SyNm56YEfZfRb3qkyu3Jasl1Y0TqerqytrEULRZnGvmsmV5QFmQbhGa2tr1iKEIo6J\njxqNXFkehnusWrUqaxFCYVZr7eTO8jCMqMRltTZKtaopD8OIkUaqVjW3xUiUOXPmZC1CqjRStapZ\nHkZkwpjnixYtyki6bGikoQ5meRiRCDuOaPXq1dkImBGNNNTBlIcRibDm+b59+9IUK3OqDXVY//Qg\n8/s2M7PnUeb3bfa6/N3clgo0SsQ8Ko1kntfCaEMd8hZMNeVRhrx9yUkQdhzR+PHj0xLJGSqlfPM2\nb4y5LWVopIh5VMKORO7u7k5TLKfJm7VmyqMMefuSkyBsRWZ/f38m8rlI3oKp5raUwVaHD0eYiswt\nW7bQ2dmZyPl9i0vlbfxMKMtDRK4WkZdEZLeI9JTZ3ykivxeRHcHfHdX6ishkEfmJiLwc/P+TWoVP\nKnJdySS/4qKW3ETKfSfqlJNZkrfxM1UtDxFpAr4DfAzYC2wVkUdU9fmSpj9T1YU19O0B/lVV+wKl\n0gN8KazgSQY1y0XMr7iohYcGBi2I6gi+Bh+rWWs+WVNh3JZ5wG5VfQVARNYBi4FS5VFr38VAZ9Du\nB0A/NSiPpG+e0i95ft9mL2/WrFm2bFkix81jXCrtLF85RVULYdyWNmBP0fu9wbZSLheRZ0XkRyLy\nvhB9z1fV4QqiN4Dzw4ud/s2Tx5vVZ/IWfIR0s3yV3L4xzRMnhz1GXAHT7cAMVT0kItcA64ELw3ZW\nVRWRsuteishyYDnAtGnT6O3tBeBzzfDIHy8G4C/PfOFk+180zQBg5cqVHDp0CICpU6eyfPlyNmzY\nwPbt20+2veWWWxgaGmLdunUnty1cuJD29vaT54HCYs2tk87norefZUbT709u/8fDHcydcGBE266u\nLlpbW0fMYzFnzhwWLVrE6tWrT1Zcjh8/nu7ubvr7+9myZcvJtsNP6jVr1pzctmDBAjo7O2O/pqVL\nl7J27doRq7qtWLGCgYGBERMXu3hNn//gn/ON7U10jX3q5LbBE5O45qprvb2m8UfeDbTwN82nllR9\n/fjZbD5wYezX9PDRy5itr3NZ86kK4Ef+eDFN4yeHNnGqrlUrIv8O+G+qelXw/nYAVf36KH1eBToo\nKJCyfUXkJaBTVfeJyFSgX1VHtZuK16otNfGgENRMKgCV9vnyQm9vb2Jrt/gUHwjD/L7NZbN8bZOa\neaLnyljPNbPn0dMmfAbY94ObeGffyxLmGGEsj63AhSIyExgEuoDPFDcQkSnArwILYh4Fd+g3wIFR\n+j4CfBboC/7/MIzAw6Q947nNsO4evk85War8SoPykFwqt1I5gh4/diTsMaoqD1U9JiI3AJuAJuB+\nVd0pItcF++8FrgWuF5FjwGGgSwsmTdm+waH7gAdF5L8ArwGfDiv0MGnfPL7frOUI+/SO+pRfsGBB\nEmJ7T7ng6EMDg/xVexs/fXF/4g+oSjUnxw/9NnSuu6rb4hLFbotRP2FdsXpctq399zD99a9xXtN+\n3jzewp4ZX2Fu5w3xX4xnpOmiVKLcA+GTc6YNqGpHmP5WYdrAhE13R02Lb+2/hy3/91Vufc+bAEwZ\n+yZn772Vrf00vAJxIXtXryVtY1samLA3cNQbffrrX+MPxyeM2NY85h2mv/61GqTMJ3lINZvyaGDC\n3sBRb/TzmvbXtL2RyMP6yKY8GpiwN3DUG/3N4y1MPWOo7PZGJw/jXCzm0cCETT9HTVPvmfEV/vOY\nW0dsO3ziDPbM+ApTYrwOX/E9e2fZFiNR7l/dxzVnftOyLZ4gIpZtMZKh1nqPPfveYcqKXwEwJfgz\n8oEpDyM0NrerUYwFTI3Q2NyuRjGmPIzQRKn3uOWWW5ISx8gYUx5GaKLUewwNnZ6qNfKBKQ8jNFHq\nPYrnqzDyhVcB0wNvH43lOHmbByItbFoCoxivlMcbb/2x7mNYxqA+fC9sMuLDK7fl6PETdR/DMgbp\nsnDhwuqNDC/xSnmMa6pfXBeGQjcS7e3tWYtgJIRXymPKxDPrPkYehkL7RPGky0Z1klrILAm8Uh6T\nzhpX9zHyMBTayCe+rYLnVcA0DnzNGFiGKP/4tgpewykP8C9j4HOGaNasWVmL4A2+xeO8clsaFZ8z\nREuXLs1aBG/wLR5nysMDfHsiFbN27dqsRfAG3+JxDem2+EalBXrifiIlEVcpXiLRGB3f4nGmPDyg\n0gI9cT6RfI6r5Amf4nHmtnhAGpPl+hxXMbLBLI+MCesqJP1ESiquktQi10b2mOWRIS4VBSUV6R8Y\nGKirv+EupjwypF5XIc5S5nKRfqGg0Oo59saNGyPLZIQjq5J2c1sypB5XIe4AZ3Gkf/DAYQQYXpTD\n5eBpo1feZhnoNssjQ+pxFZIIcC65rI0neq6kbVIzpav5uBg8dcnty4osA91eKg+fRh6ORj1FQUkW\njsV57K6urnrFqYhliLItIPTObclTPUKUoqBhM73SOn9xFI7FWZTW2tpatzxQ3j3xufI2LtIqICyH\nd5ZH3p42w67CL/s+wRM9V1ZVHMNmejniKhyLs0x61apVdctTyT05u7n8FA2ujgVJgixL2r2zPBr5\naVNOcQ7TFmOw0LUy6UoPjDPHjaF5XFOilbeuk+V35Z3yyNJMy5pKClKAJ3qujPVcw0Vpw+7CzQ/s\n4O5NL2WiRCpd94G3j/LN/3ipM0ouK7IqaQ+lPETkauDbQBPwPVXtq9BuLvD/gC5V/V/Bti8Ayyjc\n42tU9VvB9kuBe4EzgWPAf1XVn1eTpdI4jysuamF+3+Zc30RpK8444ktz5sypW47RrjvrsSCNnCqu\nGvMQkSbgO8DHgUuApSJySYV2dwKPF217PwXFMQ/4ELBQRN4b7L4L6FXVS4E7gvdVKTfO46/a23ho\nYDD3Kbu0/ds44kuLFi2qWw5Xh6pnmSp2IeMYJmA6D9itqq+o6hFgHbC4TLsbgYeAN4u2XQw8papv\nq+oxYAvwqWCfAhOD12cDodclLA0y/vTF/bkKolYijQFyxcQRX1q9enXdcqR93WHJKnjvSn1LGLel\nDdhT9H4v8GfFDUSkDfgkcAUwt2jXc8B/F5FzgMPANcC2YN9NwCYR+QYFJXZ5lAuAxgqipmmmx+Em\n7du3LxZZsnZPypHVfefKXKdxpWq/BXxJVUesyqSqL3DKlfkxsAMYvurrgZtVdTpwM3BfuQOLyHIR\n2SYi2/bv31/25L5N3+YLrroLrpDVfefKwzKM8hgEphe9nxZsK6YDWCcirwLXAt8VkSUAqnqfqrar\n6keB3wHDU0t9Fng4eP0vFNyj01DV1araoaodLS0tZQW0mzwZ4nAXxo8fn5yAGZPVfefKwzKM27IV\nuFBEZlJQGl3AZ4obqOrM4dci8n1go6quD96fp6pvisgMCvGODwdNh4AFQD9wJfBy1ItwrS4hT9Tr\nLnR3d8cojVtkdd+lMbNcGKoqD1U9JiI3AJsopGrvV9WdInJdsP/eKod4KIh5HAX+VlUPBNuXAd8W\nkbHAH4HlUS8C3PSJDejv76ezszNrMRIji/vOlYelqFYaJeEeHR0dum3btuoNDcCNGoTe3l6bTcwj\nRGRAVTvCtPWuwtQIR54GEBpu4t3AOCMceRtAaLiHKY+c4ko6b9myZamez0gPUx45xZV0npFfTHnk\nFFdqX9asWZPq+Yz0sIBpTnElnecycWWjsspqZZ1NM+WRY6z2pTJxZaOyymq5kE0zt8VIlAULFmQt\nQlniykZlldVyIZtmysNIFFerS+PKRmWV1XIhm2bKw0iUlStXZi1CWeLKRmWV1XIhm2bKw0iUQ4cO\nZS1CWeLKRmWV1XIhm2YBU6MhiSsblVVWy4VsmikPI1GmTp2atQgViSsblVVWK+tsmrktRqIsX17X\nTAuGw5jyMBJlw4YNWYtgJIQpDyNRtm/fnrUIRkKY8jAMIxJeKY8Dbx/NWgTDMAK8moZw/LTZemiv\nTWbjEwcPHmTChAlZi2GEJLfTEB49fqJ6I8MphoaGmD07ucKlrEeWNjJeuS3jmrwS1wDWrVuX2LFd\nWXaxUfHq1zhl4plZi2A4hAsjSxsZr9yWSWeNy1qERDDTOxoujCxtZLyyPPJI3k3vhQsXJnZsF0aW\nNjKmPDIm76Z3e3t7Ysd2YWRpI9NwymP904PM79vMzJ5Hmd+3OfMnfN5N797e3sSOHcdC3EZ0vIp5\n1EuYeR/Tjj+0TmpmsIyiMNM7HFmPLG1kGsryqOYiZBF/MNPb8JWGUh7VXIQs4g95N71nzZqVtQhG\nQjSU21LJRTi7uZACzir+kGfTe+nSpamez9Le6dFQlsdtV81m3Bg5bfsfjhxj/dODlvpLgLVr16Z2\nrrynvV2joZTHksvaGH/m6cbW0ePK3ZtesvhDAuzatSu1c+U97e0aDeW2QOVh/UMHDjsxqawRnbyn\nvV2j4ZRHtdSoC/EH89ujYWnvdAnltojI1SLykojsFpGeUdrNFZFjInJt0bYviMhzIrJTRG4qaX+j\niLwY7Lsr+mWEx3XXJG9++4oVK2pqX08Rn+vfbd6oanmISBPwHeBjwF5gq4g8oqrPl2l3J/B40bb3\nA8uAecAR4McislFVd4vIFcBi4EOq+o6InBfXRY2G667JaH67KzLWwsDAQOgS9XoXb3b9u02TNKzX\nMG7LPGC3qr4CICLrKPzony9pdyPwEDC3aNvFwFOq+nbQdwvwKeAu4HqgT1XfAVDVN+u4jppwwTWp\nRN789o0bN4ZWHnEoTpe/27SoVwmHJYzb0gbsKXq/N9h2EhFpAz4J/ENJ3+eAj4jIOSJyFnANMD3Y\nNyvY95SIbBGRuRgNnS7Om+LMirSyTnGlar8FfElVR8wTqKovcMqV+TGwAxi+qrHAZODDwG3AgyJy\nWhGGiCwXkW0ism3//v0xiesujey3N7LijJO0lHAY5THIKWsBYFqwrZgOYJ2IvApcC3xXRJYAqOp9\nqtquqh8FfgcMJ/73Ag9rgZ8DJ4BzS0+uqqtVtUNVO1paWmq4ND/JW7l6V1dX6LaNrDjjJC0lHCbm\nsRW4UERmUlAaXcBnihuo6szh1yLyfWCjqq4P3p+nqm+KyAwK8Y4PB03XA1cAPxWRWcC7gF/Xdzn5\nIE9+e2tra+i2FvCMh9uumj0i5gHJKOGqykNVj4nIDcAmoAm4X1V3ish1wf57qxziIRE5BzgK/K2q\nHgi23w/cLyLPUcjEfFZ9WgfCCMWqVatqStfmSXFmRVpKOFSRmKo+BjxWsq2s0lDVz5W8/0iFdkeA\n/xRKSsMwaiINJdxwFaaGERWr/B2JKQ8jUebMmZO1CLGQVu2ET+RGedhTwU0WLVqUtQixkGblbxr3\nchznyMWQ/LyNB8kTq1evzlqEWEirdiKNezmuc+RCedg8DtXJatb4ffv2pXKepEmrdiKNezmuc+RC\neVhZ8+iYZVY/aRWwpXEvx3WOXCgPK2uuzPqnB+l+8JnMLLPx48cnfo40SKvyN417Oa5z5EJ5WFlz\neYYtjuMVau/SsMy6u7sTP0daLLmsjSd6ruSXfZ/giZ4rEwnIp3Evx3WOXCiPvI0HiYtyvm0xaVhm\n/f39iZ8jT6RxL8d1DvGpIvyMqRdqxxf+p6VhQzKz51EqfbvN45pSUbC9vb01zyZmZIeIDKhqR5i2\n3lkeFuwLTyXLoknELDOjbrxTHmBp2LBU8m1XfvpDuVYcri1mnle8rTB1PQ3rQsWrC0Pcly1bltq5\nwMrI08Rb5eFyGtalG7jRhrjnbQJpl/HSbXE9Det6xWuaZv2aNWsSO3Y5rGAwPbyzPNo8GPTm8g3s\nklWUBLbwU3p4ZXl8oO3sxIpz4sTlilfXraJ6sYLBAmlYl17VeYjIfuC1BA59LjHOnzqmeeLksRNb\n3o3IKeWseuLYW/tfO3H4rd/Wefi6ZH3XlPdWXETlyBu7B6IedxRi/WzDMKZ54uSm8ZPbpGnsu/T4\nsSPHD/12MOTnnrqsdVJW3jrvv3eraqiZxr1SHkkhItvCFsZkjU+ygl/y+iQrZC+vV26LYRjuYMrD\nMIxImPIo4NN0Vz7JCn7J65OskLG8FvMwDCMSZnkYhhGJ3CkPEblaRF4Skd0i0jNKu7kickxEri3a\n9gUReU5EdorITSXtbxSRF4N9d7kqq4hcKiJPisiOYIHweXHIGkZeEekUkd8H594hIndU6ysik0Xk\nJyLycvD/TxyW9e7gHnhWRP63iEyKQ9ak5C3a3y0iKiKnrQVdF6qamz8Ky2H+AngPhbVvnwEuqdBu\nM4VV8K4Ntr0feA44i0Ll7f8B3hvsuyJ4f0bw/jyHZX0c+Hjw+hqgP63PFuiksE5x6L7AXUBP8LoH\nuNNhWf8DMDZ4fWccsiYpb7B/OoWlYl8Dzo3z95Y3y2MesFtVX9HCcpbrgMVl2t0IPAS8WbTtYuAp\nVX1bVY8BWygszA1wPdCnqu8AqGpxP9dkVWBi8PpsYCgGWWuRt9a+i4EfBK9/ACxxVVZVfTz4vAGe\nBKbFIGti8gZ8E/giVJwXKjJ5Ux5twJ6i93uDbScRkTbgk8A/lPR9DviIiJwjImdReGpPD/bNCvY9\nJSJbRGSuw7LeBNwtInuAbwC3xyBrKHkDLg/M+h+JyPtC9D1fVYfXZ3gDON9hWYv5PPCjGGSt5Zw1\nySsii4FBVX0mJjlH4N3AuBj4FvAlVT0hIic3quoLInInBbP/D8AOYHgQyFhgMvBhYC7woIi8RwO7\n0DFZrwduVtWHROTTwH3Av09YzmG2AzNU9ZCIXAOsBy4M21lVVUTSSv9FllVEvgwcA/45QflKqUne\n4KHydxRcrUTIm+UxyKknMBTMytIRQR3AOhF5FbgW+K6ILAFQ1ftUtV1VPwr8DtgV9NkLPKwFfg6c\noDCuwEUK6wnFAAABQ0lEQVRZPws8HLz+FwpmbRxUlVdV31LVQ8Hrx4BxQZButL6/EpGpAMH/OFzC\npGRFRD4HLAT+OsaHRxLy/ikwE3gmuH+mAdtFZEpMMucuYDoWeCX40IaDR+8bpf33CYKQwfvzgv8z\ngBeBScH764CvBq9nUTATxVFZXwA6g9d/AQyk9dkCUzhVOzQPeB2Q0foCdzMyYHqXw7JeDTwPtKR9\n30aRt6T/q8QcME3lR53mHwX/fxeFCPSXg23XAdeVaVv6g/xZcHM8A/xF0fZ3Af9EIdawHbjSYVn/\nHBgItj8FtKf12QI3ADuDcz8JXD5a32D7OcC/Ai9TyBpNdljW3RQeHDuCv3td/mxLjh+78rAKU8Mw\nIpG3mIdhGClhysMwjEiY8jAMIxKmPAzDiIQpD8MwImHKwzCMSJjyMAwjEqY8DMOIxP8HmJHsvGf+\noysAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "gca().set_aspect('equal')\n", "xlim(0.495, 0.505)\n", "ylim(0.495, 0.505)\n", "axvline(x=0.5, ls='--', c='gray', lw=1)\n", "axhline(y=0.5, ls='--', c='gray', lw=1)\n", "scatter(xx, yy)\n", "scatter([xx[idx]], [yy[idx]], c='orange');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KD-дерево значительно быстрее поиска в лоб, даже векторизованного" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "112510 0.0003442443639\n", "Wall time: 1.12 s\n" ] } ], "source": [ "%%time\n", "mindist = 1.0\n", "minidx = -1\n", "for i in xrange(N):\n", " dist = (xx[i] - 0.5) ** 2 + (yy[i] - 0.5) ** 2\n", " if dist < mindist:\n", " mindist = dist\n", " minidx = i\n", "print minidx, sqrt(mindist)" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "112510 0.0003442443639\n", "Wall time: 34 ms\n" ] } ], "source": [ "%%time\n", "minidx = argmin((xx - 0.5) ** 2 + (yy - 0.5) ** 2)\n", "print minidx, sqrt((xx[minidx] - 0.5) ** 2 + (yy[minidx] - 0.5) ** 2)" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10000 loops, best of 3: 32.7 µs per loop\n" ] } ], "source": [ "%timeit index.query([0.5, 0.5])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }