{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Справочник" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Токенизатор" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Токенизатор в Yargy реализован на регулярных выражениях. Для каждого типа токена есть правило с регуляркой:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[TokenRule(\n", " type='RU',\n", " pattern='[а-яё]+'\n", " ),\n", " TokenRule(\n", " type='LATIN',\n", " pattern='[a-z]+'\n", " ),\n", " TokenRule(\n", " type='INT',\n", " pattern='\\\\d+'\n", " ),\n", " TokenRule(\n", " type='PUNCT',\n", " pattern='[-\\\\\\\\/!#$%&()\\\\[\\\\]\\\\*\\\\+,\\\\.:;<=>?@^_`{|}~№…\"\\\\\\'«»„“ʼʻ”]'\n", " ),\n", " TokenRule(\n", " type='EOL',\n", " pattern='[\\\\n\\\\r]+'\n", " ),\n", " TokenRule(\n", " type='OTHER',\n", " pattern='\\\\S'\n", " )]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import RULES\n", "\n", "\n", "RULES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Токенизатор инициализируется списком правил. По-умолчанию — это `RULES`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='a',\n", " span=[0, 1),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='@',\n", " span=[1, 2),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value='mail',\n", " span=[2, 6),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='.',\n", " span=[6, 7),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value='ru',\n", " span=[7, 9),\n", " type='LATIN'\n", " )]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import Tokenizer\n", "\n", "\n", "text = 'a@mail.ru'\n", "tokenizer = Tokenizer()\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пользователь может убрать часть правил из списка или добавить новые. Уберём токены с переводами строк:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='\\n',\n", " span=[0, 1),\n", " type='EOL'\n", " ),\n", " Token(\n", " value='abc',\n", " span=[1, 4),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='\\n',\n", " span=[4, 5),\n", " type='EOL'\n", " ),\n", " Token(\n", " value='123',\n", " span=[5, 8),\n", " type='INT'\n", " ),\n", " Token(\n", " value='\\n',\n", " span=[8, 9),\n", " type='EOL'\n", " )]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = Tokenizer()\n", "\n", "text = '''\n", "abc\n", "123\n", "'''\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для этого удалим правило `EOL`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='abc',\n", " span=[1, 4),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='123',\n", " span=[5, 8),\n", " type='INT'\n", " )]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = Tokenizer().remove_types('EOL')\n", "\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В Yargy есть примитивные правила для токенизации емейлов и телефонов. По-умолчанию они отключены:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='email',\n", " span=[0, 5),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value=':',\n", " span=[5, 6),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value='ab@mail.ru',\n", " span=[7, 17),\n", " type='EMAIL'\n", " ),\n", " Token(\n", " value='call',\n", " span=[18, 22),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value=':',\n", " span=[22, 23),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value=' 8 915 132 54 76',\n", " span=[23, 39),\n", " type='PHONE'\n", " )]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import EMAIL_RULE, PHONE_RULE\n", "\n", "\n", "text = 'email: ab@mail.ru call: 8 915 132 54 76'\n", "tokenizer = Tokenizer().add_rules(EMAIL_RULE, PHONE_RULE)\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Добавим собственное для извлечения доменов:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='на',\n", " span=[0, 2),\n", " type='RU'\n", " ),\n", " Token(\n", " value='сайте',\n", " span=[3, 8),\n", " type='RU'\n", " ),\n", " Token(\n", " value='www.VKontakte.ru',\n", " span=[9, 25),\n", " type='DOMAIN'\n", " )]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import TokenRule\n", "\n", "\n", "DOMAIN_RULE = TokenRule('DOMAIN', '[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+')\n", "\n", " \n", "text = 'на сайте www.VKontakte.ru'\n", "tokenizer = Tokenizer().add_rules(DOMAIN_RULE)\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "По умолчанию, Yargy использует не `Tokenizer`, а `MorphTokenizer`. Для каждого токена с типом `'RU'` он запускает Pymorphy2, добавляет поле `forms` с морфологией:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='X',\n", " span=[0, 1),\n", " type='LATIN'\n", " ),\n", " MorphToken(\n", " value='век',\n", " span=[2, 5),\n", " type='RU',\n", " forms=[Form('век', Grams(NOUN,inan,masc,nomn,sing)),\n", " Form('век', Grams(NOUN,accs,inan,masc,sing)),\n", " Form('век', Grams(ADVB)),\n", " Form('веко', Grams(NOUN,gent,inan,neut,plur))]\n", " ),\n", " MorphToken(\n", " value='стал',\n", " span=[6, 10),\n", " type='RU',\n", " forms=[Form('стать', Grams(VERB,indc,intr,masc,past,perf,sing))]\n", " )]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import MorphTokenizer\n", "\n", "tokenizer = MorphTokenizer()\n", "list(tokenizer('X век стал'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Газеттир" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Словарь профессий, географических объектов можно записывать стандартные средствами через `rule`, `or_`, `normalized`, `caseless`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from yargy import rule, or_\n", "from yargy.predicates import normalized, caseless\n", "\n", "\n", "POSITION = or_(\n", " rule(normalized('генеральный'), normalized('директор')),\n", " rule(normalized('бухгалтер'))\n", ")\n", "\n", "GEO = or_(\n", " rule(normalized('Ростов'), '-', caseless('на'), '-', caseless('Дону')),\n", " rule(normalized('Москва'))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Это неудобно, легко ошибиться. Для составления словарей в Yargy используется `pipeline`. Реализовано два типа газеттиров: `morph_pipeline` и `caseless_pipeline`. `morph_pipeline` перед работой приводит слова к нормальной форме:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['электронным', 'дневником']\n", "['электронные', 'дневники']\n", "['электронное', 'дневнику']\n" ] } ], "source": [ "from yargy import Parser\n", "from yargy.pipelines import morph_pipeline\n", "\n", "\n", "TYPE = morph_pipeline(['электронный дневник'])\n", "\n", "parser = Parser(TYPE)\n", "text = 'электронным дневником, электронные дневники, электронное дневнику'\n", "for match in parser.findall(text):\n", " print([_.value for _ in match.tokens])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`caseless_pipeline` ищет слова без нормализации. Например, найдём в тексте арабские имена: \"Абд Аль-Азиз Бин Мухаммад\", \"Абд ар-Рахман Наср ас-Са ди\": " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Абд', 'Аль', '-', 'Азиз', 'Бин', 'Мухаммад']\n", "['АБД', 'АР', '-', 'РАХМАН', 'НАСР', 'АС', '-', 'СА', 'ДИ']\n" ] } ], "source": [ "from yargy.pipelines import caseless_pipeline\n", "\n", "\n", "NAME = caseless_pipeline([\n", " 'Абд Аль-Азиз Бин Мухаммад',\n", " 'Абд ар-Рахман Наср ас-Са ди'\n", "])\n", " \n", "parser = Parser(NAME)\n", "text = 'Абд Аль-Азиз Бин Мухаммад, АБД АР-РАХМАН НАСР АС-СА ДИ'\n", "for match in parser.findall(text):\n", " print([_.value for _ in match.tokens])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Предикаты" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
a == b\n", "\n", " >>> predicate = eq('1')\n", " >>> token, = tokenize('1')\n", " >>> predicate(token)\n", " True\n", "
a.lower() == b.lower()\n", "\n", " >>> predicate = caseless('Рано')\n", " >>> token, = tokenize('РАНО')\n", " >>> predicate(token)\n", " True\n", "
a in b\n", "\n", " >>> predicate = in_({'S', 'M', 'L'})\n", " >>> a, b = tokenize('S 1')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " False\n", "
a.lower() in b\n", "\n", " >>> predicate = in_caseless({'S', 'M', 'L'})\n", " >>> a, b = tokenize('S m')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " True\n", "
a >= b\n", "\n", " >>> predicate = gte(4)\n", " >>> a, b, c = tokenize('3 5 C')\n", " >>> predicate(a)\n", " False\n", " >>> predicate(b)\n", " True\n", " >>> predicate(c)\n", " False\n", "
a <= b\n", "\n", " >>> predicate = lte(4)\n", " >>> a, b, c = tokenize('3 5 C')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " False\n", " >>> predicate(c)\n", " False\n", "
len(a) == b\n", "\n", " >>> predicate = length_eq(3)\n", " >>> a, b = tokenize('XXX 123')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " True\n", "
Нормальная форма слова == value\n", "\n", " >>> a = activate(normalized('сталь'))\n", " >>> b = activate(normalized('стать'))\n", " >>> token, = tokenize('стали')\n", " >>> a(token)\n", " True\n", " >>> b(token)\n", " True\n", "
Нормальная форма слова in value\n", "\n", " >>> predicate = activate(dictionary({'учитель', 'врач'}))\n", " >>> a, b = tokenize('учителя врачи')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " True\n", "
value есть среди граммем слова\n", "\n", " >>> a = activate(gram('NOUN'))\n", " >>> b = activate(gram('VERB'))\n", " >>> token, = tokenize('стали')\n", " >>> a(token)\n", " True\n", " >>> b(token)\n", " True\n", "
Тип токена равен value\n", "\n", " >>> predicate = activate(type('INT'))\n", " >>> a, b = tokenize('3 раза')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " False\n", "
Тег токена равен value\n", "
function в качестве предиката\n", "\n", " >>> from math import log\n", " >>> f = lambda x: int(log(int(x), 10)) == 2\n", " >>> predicate = activate(custom(f, types=INT))\n", " >>> a, b = tokenize('12 123')\n", " >>> predicate(a)\n", " False\n", " >>> predicate(b)\n", " True\n", "
Всегда возвращает True\n", "\n", " >>> predicate = true()\n", " >>> predicate(False)\n", " True\n", "
str.islower\n", "\n", " >>> predicate = is_lower()\n", " >>> a, b = tokenize('xxx Xxx')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " False\n", "
str.isupper\n", "\n", " >>> predicate = is_upper()\n", " >>> a, b = tokenize('XXX xxx')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " False\n", "
str.istitle\n", "\n", " >>> predicate = is_title()\n", " >>> a, b = tokenize('XXX Xxx')\n", " >>> predicate(a)\n", " False\n", " >>> predicate(b)\n", " True\n", "
Слово написано с большой буквы\n", "\n", " >>> predicate = is_capitalized()\n", " >>> a, b, c = tokenize('Xxx XXX xxX')\n", " >>> predicate(a)\n", " True\n", " >>> predicate(b)\n", " True\n", " >>> predicate(c)\n", " False\n", "
Слово в единственном числе\n", "\n", " >>> predicate = is_single()\n", " >>> token, = tokenize('слово')\n", " >>> predicate(token)\n", " True" ], "text/plain": [ "
{doc}'\n", "\n", "HTML('\\n'.join(html()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Интерпретация" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Объект-результат интерпретации описывает конструктор `fact`. `attribute` задаёт значение поля по-умолчанию. Например, в `Date` по-умолчанию год будет равен 2017:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date(\n", " year='2016',\n", " month='июля',\n", " day='18'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Date(\n", " year=2017,\n", " month='марта',\n", " day='15'\n", ")" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import display\n", "\n", "from yargy import Parser, rule, and_, or_\n", "from yargy.interpretation import fact, attribute\n", "from yargy.predicates import dictionary, gte, lte\n", "\n", "\n", "Date = fact(\n", " 'Date',\n", " [attribute('year', 2017), 'month', 'day']\n", ")\n", "\n", "\n", "MONTHS = {\n", " 'январь',\n", " 'февраль',\n", " 'март',\n", " 'апрель',\n", " 'мая',\n", " 'июнь',\n", " 'июль',\n", " 'август',\n", " 'сентябрь',\n", " 'октябрь',\n", " 'ноябрь',\n", " 'декабрь'\n", "}\n", "\n", "\n", "MONTH_NAME = dictionary(MONTHS)\n", "DAY = and_(\n", " gte(1),\n", " lte(31)\n", ")\n", "YEAR = and_(\n", " gte(1900),\n", " lte(2100)\n", ")\n", "DATE = rule(\n", " DAY.interpretation(\n", " Date.day\n", " ),\n", " MONTH_NAME.interpretation(\n", " Date.month\n", " ),\n", " YEAR.interpretation(\n", " Date.year\n", " ).optional()\n", ").interpretation(\n", " Date\n", ")\n", "\n", "\n", "text = '''18 июля 2016\n", "15 марта\n", "'''\n", "parser = Parser(DATE)\n", "for line in text.splitlines():\n", " match = parser.match(line)\n", " display(match.fact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для дат деревья разбора выглядят просто: вершина-конструктор и несколько детей-атрибутов:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "parser = Parser(DATE)\n", "for line in text.splitlines():\n", " match = parser.match(line)\n", " display(match.tree.as_dot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Как будет себя вести алгоритм интерпретации, когда ребёнок конструктора не атрибут, а другой конструктор? Или когда ребёнок атрибута другой атрибут? Или когда под конструктором или атрибутом не одна, а несколько вершин с токенами? Пойдём от простого к сложному. Когда под вершиной-атрибутом несколько токенов, они объединяются:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.predicates import eq, type, dictionary\n", "\n", "\n", "Money = fact(\n", " 'Money',\n", " ['value', 'currency']\n", ")\n", "MONEY = rule(\n", " rule(\n", " type('INT'),\n", " dictionary({\n", " 'тысяча',\n", " 'миллион'\n", " })\n", " ).interpretation(\n", " Money.value\n", " ),\n", " eq('$').interpretation(\n", " Money.currency\n", " )\n", ").interpretation(\n", " Money\n", ")\n", "\n", "parser = Parser(MONEY)\n", "match = parser.match('5 тысяч$')\n", "match.tree.as_dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В `Money.value` два слова:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Money(\n", " value='5 тысяч',\n", " currency='$'\n", ")" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Когда под вершиной-атрибутом смесь из токенов и вершин-конструктов, интерпретация кидает `TypeError`:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.predicates import true\n", "\n", "\n", "A = fact(\n", " 'A',\n", " ['x']\n", ")\n", "B = fact(\n", " 'B',\n", " ['y']\n", ")\n", "RULE = rule(\n", " true(),\n", " true().interpretation(\n", " B.y\n", " ).interpretation(\n", " B\n", " )\n", ").interpretation(\n", " A.x\n", ").interpretation(\n", " A\n", ")\n", "\n", "parser = Parser(RULE)\n", "match = parser.match('X Y')\n", "match.tree.as_dot" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# match.fact Будет TypeError" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если под вершиной-атрибутом другая вершина-атрибут, нижняя просто исчезает:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.predicates import true\n", "\n", "\n", "A = fact(\n", " 'A',\n", " ['x', 'y']\n", ")\n", "RULE = true().interpretation(\n", " A.x\n", ").interpretation(\n", " A.y\n", ").interpretation(A)\n", "\n", "parser = Parser(RULE)\n", "match = parser.match('X')\n", "match.tree.as_dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"X\" попадёт в `A.y`, не в `A.x`:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A(\n", " x=None,\n", " y='X'\n", ")" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Что если под вершиной-конструктом несколько одинаковых вершин-атрибутов? Самый правый атрибут перезаписывает все остальные:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = fact(\n", " 'A',\n", " ['x']\n", ")\n", "RULE = true().interpretation(\n", " A.x\n", ").repeatable().interpretation(\n", " A\n", ")\n", "\n", "parser = Parser(RULE)\n", "match = parser.match('1 2 3')\n", "match.tree.normalized.as_dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В `A.x` попадёт \"3\":" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "A(\n", " x='3'\n", ")" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Но бывает нужно сохранить содержание всех повторяющихся вершин-атрибутов, не только самой правой. Помечаем поле как `repeatable`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy import not_\n", "\n", "\n", "Item = fact(\n", " 'Item',\n", " [attribute('titles').repeatable()]\n", ")\n", "\n", "TITLE = rule(\n", " '«',\n", " not_(eq('»')).repeatable(),\n", " '»'\n", ")\n", "ITEM = rule(\n", " TITLE.interpretation(\n", " Item.titles\n", " ),\n", " eq(',').optional()\n", ").repeatable().interpretation(\n", " Item\n", ")\n", "\n", "parser = Parser(ITEM)\n", "text = '«Каштанка», «Дядя Ваня»'\n", "match = parser.match(text)\n", "match.tree.as_dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "«Дядя Ваня» не перезапишет «Каштанка», они оба окажутся в `Item.titles`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Item(\n", " titles=['«Каштанка»',\n", " '«Дядя Ваня»']\n", ")" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Остался последний неочевидный случай, когда ребёнок вершины-конструктора, другая вершина-конструктор. Такая ситуация возникает при использовании рекурсивных грамматик. В примере ребёнок вершины `Item` другая вершина `Item`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy import forward, or_\n", "\n", "Item = fact(\n", " 'Item',\n", " ['title', 'date']\n", ")\n", "\n", "ITEM = forward().interpretation(\n", " Item\n", ")\n", "ITEM.define(or_(\n", " TITLE.interpretation(\n", " Item.title\n", " ),\n", " rule(ITEM, TITLE),\n", " rule(\n", " ITEM,\n", " DATE.interpretation(\n", " Item.date\n", " )\n", " )\n", "))\n", "\n", "parser = Parser(ITEM)\n", "text = '«Каштанка» 18 июня'\n", "match = parser.match(text)\n", "match.tree.as_dot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В ходе интерпретации появится два объекта: `Item(title='«Каштанка»', date=None)` и `Item(title=None, date=Date('18', 'июня'))`. В конце произойдёт слияние:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Item(\n", " title='«Каштанка»',\n", " date=Date(\n", " year=2017,\n", " month='июня',\n", " day='18'\n", " )\n", ")" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Нормализация" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В Yargy реализованы четыре основных метода для нормализации: `normalized`, `inflected`, `custom` и `const`. `normalized` возвращает нормальную форму слова, соответствует `normal_form` в Pymorphy2:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date(\n", " year='2015',\n", " month='июня',\n", " day='8'\n", ")" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATE = rule(\n", " DAY.interpretation(\n", " Date.day\n", " ),\n", " MONTH_NAME.interpretation(\n", " Date.month\n", " ),\n", " YEAR.interpretation(\n", " Date.year\n", " )\n", ").interpretation(\n", " Date\n", ")\n", "\n", "parser = Parser(DATE)\n", "match = parser.match('8 июня 2015')\n", "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "С `normalized` слово \"июня\" меняется на \"июнь\":" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date(\n", " year='2015',\n", " month='июнь',\n", " day='8'\n", ")" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DATE = rule(\n", " DAY.interpretation(\n", " Date.day\n", " ),\n", " MONTH_NAME.interpretation(\n", " Date.month.normalized()\n", " ),\n", " YEAR.interpretation(\n", " Date.year\n", " )\n", ").interpretation(\n", " Date\n", ")\n", "\n", "parser = Parser(DATE)\n", "match = parser.match('8 июня 2015')\n", "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Если в `normalized` попадает несколько токенов, каждый приводится к нормальной форме без согласования:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Geo(\n", " name='красный площадь'\n", ")" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from yargy.interpretation import fact\n", "from yargy.predicates import normalized\n", "from IPython.display import display\n", "\n", "\n", "Geo = fact(\n", " 'Geo',\n", " ['name']\n", ")\n", "\n", "RULE = rule(\n", " normalized('Красная'),\n", " normalized('площадь')\n", ").interpretation(\n", " Geo.name.normalized()\n", ").interpretation(\n", " Geo\n", ")\n", "\n", "parser = Parser(RULE)\n", "for match in parser.findall('на Красной площади'):\n", " display(match.fact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Особым образом ведёт себя `normalized`, когда идёт после газеттира. Результат нормализации — ключ газеттира:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Geo(\n", " name='красная площадь'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Geo(\n", " name='первомайская улица'\n", ")" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from yargy.pipelines import morph_pipeline\n", "\n", "RULE = morph_pipeline([\n", " 'красная площадь',\n", " 'первомайская улица'\n", "]).interpretation(\n", " Geo.name.normalized()\n", ").interpretation(\n", " Geo\n", ")\n", "\n", "parser = Parser(RULE)\n", "for match in parser.findall('c Красной площади на Первомайскую улицу'):\n", " display(match.fact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`inflected` склоняет слово, соответствует методу `inflect` в Pymorphy2:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name(\n", " first='саша'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name(\n", " first='маша'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name(\n", " first='вадим'\n", ")" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from yargy.interpretation import fact\n", "from yargy.predicates import gram\n", "\n", "Name = fact(\n", " 'Name',\n", " ['first']\n", ")\n", "\n", "NAME = gram('Name').interpretation(\n", " Name.first.inflected()\n", ").interpretation(\n", " Name\n", ")\n", "\n", "parser = Parser(NAME)\n", "for match in parser.findall('Саше, Маше, Вадиму'):\n", " display(match.fact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`inflected` принимает набор граммем:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name(\n", " first='саш'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name(\n", " first='маш'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Name(\n", " first='вадимов'\n", ")" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "NAME = gram('Name').interpretation(\n", " Name.first.inflected({'accs', 'plur'}) # винительный падеж, множественное число\n", ").interpretation(\n", " Name\n", ")\n", "\n", "parser = Parser(NAME)\n", "for match in parser.findall('Саша, Маша, Вадим'):\n", " display(match.fact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`custom` применяет к слову произвольную функцию:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Float(\n", " value=3.1415\n", ")" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.interpretation import fact\n", "from yargy.predicates import type\n", "\n", "Float = fact(\n", " 'Float',\n", " ['value']\n", ")\n", "\n", "\n", "INT = type('INT')\n", "FLOAT = rule(\n", " INT,\n", " '.',\n", " INT\n", ").interpretation(\n", " Float.value.custom(float)\n", ").interpretation(\n", " Float\n", ")\n", "\n", "parser = Parser(FLOAT)\n", "match = parser.match('3.1415')\n", "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`custom` может применяться вместе с `normalized`. Тогда слово начала ставится в нормальную форму, потом к нему применяется функция:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date(\n", " year=2015,\n", " month=6,\n", " day=8\n", ")" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MONTHS = {\n", " 'январь': 1,\n", " 'февраль': 2,\n", " 'март': 3,\n", " 'апрель': 4,\n", " 'мая': 5,\n", " 'июнь': 6,\n", " 'июль': 7,\n", " 'август': 8,\n", " 'сентябрь': 9,\n", " 'октябрь': 10,\n", " 'ноябрь': 11,\n", " 'декабрь': 12\n", "}\n", "\n", "DATE = rule(\n", " DAY.interpretation(\n", " Date.day.custom(int)\n", " ),\n", " MONTH_NAME.interpretation(\n", " Date.month.normalized().custom(MONTHS.__getitem__)\n", " ),\n", " YEAR.interpretation(\n", " Date.year.custom(int)\n", " )\n", ").interpretation(\n", " Date\n", ")\n", "\n", "parser = Parser(DATE)\n", "match = parser.match('8 июня 2015')\n", "match.fact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`const` просто заменяет слово или словосочетания фиксированным значением:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Era(\n", " value='AD'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Era(\n", " value='BC'\n", ")" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Era = fact(\n", " 'Era',\n", " ['value']\n", ")\n", "\n", "BC = morph_pipeline([\n", " 'до нашей эры',\n", " 'до н.э.'\n", "]).interpretation(\n", " Era.value.const('BC')\n", ")\n", "AD = morph_pipeline([\n", " 'наша эра',\n", " 'н.э.'\n", "]).interpretation(\n", " Era.value.const('AD')\n", ")\n", "ERA = or_(\n", " BC,\n", " AD\n", ").interpretation(\n", " Era\n", ")\n", "\n", "parser = Parser(ERA)\n", "for match in parser.findall('наша эра, до н.э.'):\n", " display(match.fact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Согласование" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В Yargy реализовано четыре типа согласований: `gender_relation` — согласование по роду, `number_relation` — по числу, `case_relation` — по падежу, `gnc_relation` — по роду, числу и падежу. Метод `match` указывает согласование:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name(\n", " first='саша',\n", " last='иванова'\n", ")" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from yargy.relations import gnc_relation\n", "\n", "Name = fact(\n", " 'Name',\n", " ['first', 'last']\n", ")\n", "\n", "gnc = gnc_relation()\n", "\n", "NAME = rule(\n", " gram('Name').interpretation(\n", " Name.first.inflected()\n", " ).match(gnc),\n", " gram('Surn').interpretation(\n", " Name.last.inflected()\n", " ).match(gnc)\n", ").interpretation(\n", " Name\n", ")\n", "\n", "parser = Parser(NAME)\n", "match = parser.match('Сашу Иванову')\n", "display(match.fact)\n", "display(match.tree.as_dot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`main` указывает на главное слово во фразе. По-умолчанию главное слово — самое левое:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.relations import main\n", "\n", "POSITION = rule(\n", " normalized('главный'),\n", " main(normalized('бухгалтер'))\n", ")\n", "\n", "POSITION.as_dot" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Graph(nodes=[...], edges=[...])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.relations import case_relation\n", "\n", "case = case_relation()\n", "\n", "PERSON = rule(\n", " POSITION.match(case),\n", " NAME.match(case)\n", ")\n", "\n", "\n", "parser = Parser(PERSON)\n", "assert not parser.match('главного бухгалтер марину игореву')\n", "\n", "match = parser.match('главного бухгалтера марину игореву')\n", "match.tree.as_dot" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }