{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Справочник" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Токенизатор" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Токенизатор в Yargy реализован на регулярных выражениях. Для каждого типа токена есть правило с регуляркой:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[TokenRule(\n", " type='RU',\n", " pattern='[а-яё]+'\n", " ),\n", " TokenRule(\n", " type='LATIN',\n", " pattern='[a-z]+'\n", " ),\n", " TokenRule(\n", " type='INT',\n", " pattern='\\\\d+'\n", " ),\n", " TokenRule(\n", " type='PUNCT',\n", " pattern='[-\\\\\\\\/!#$%&()\\\\[\\\\]\\\\*\\\\+,\\\\.:;<=>?@^_`{|}~№…\"\\\\\\'«»„“ʼʻ”]'\n", " ),\n", " TokenRule(\n", " type='EOL',\n", " pattern='[\\\\n\\\\r]+'\n", " ),\n", " TokenRule(\n", " type='OTHER',\n", " pattern='\\\\S'\n", " )]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import RULES\n", "\n", "\n", "RULES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Токенизатор инициализируется списком правил. По-умолчанию — это `RULES`:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='a',\n", " span=[0, 1),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='@',\n", " span=[1, 2),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value='mail',\n", " span=[2, 6),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='.',\n", " span=[6, 7),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value='ru',\n", " span=[7, 9),\n", " type='LATIN'\n", " )]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import Tokenizer\n", "\n", "\n", "text = 'a@mail.ru'\n", "tokenizer = Tokenizer()\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пользователь может убрать часть правил из списка или добавить новые. Уберём токены с переводами строк:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='\\n',\n", " span=[0, 1),\n", " type='EOL'\n", " ),\n", " Token(\n", " value='abc',\n", " span=[1, 4),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='\\n',\n", " span=[4, 5),\n", " type='EOL'\n", " ),\n", " Token(\n", " value='123',\n", " span=[5, 8),\n", " type='INT'\n", " ),\n", " Token(\n", " value='\\n',\n", " span=[8, 9),\n", " type='EOL'\n", " )]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = Tokenizer()\n", "\n", "text = '''\n", "abc\n", "123\n", "'''\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Для этого удалим правило `EOL`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='abc',\n", " span=[1, 4),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value='123',\n", " span=[5, 8),\n", " type='INT'\n", " )]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = Tokenizer().remove_types('EOL')\n", "\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В Yargy есть примитивные правила для токенизации емейлов и телефонов. По-умолчанию они отключены:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='email',\n", " span=[0, 5),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value=':',\n", " span=[5, 6),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value='ab@mail.ru',\n", " span=[7, 17),\n", " type='EMAIL'\n", " ),\n", " Token(\n", " value='call',\n", " span=[18, 22),\n", " type='LATIN'\n", " ),\n", " Token(\n", " value=':',\n", " span=[22, 23),\n", " type='PUNCT'\n", " ),\n", " Token(\n", " value=' 8 915 132 54 76',\n", " span=[23, 39),\n", " type='PHONE'\n", " )]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import EMAIL_RULE, PHONE_RULE\n", "\n", "\n", "text = 'email: ab@mail.ru call: 8 915 132 54 76'\n", "tokenizer = Tokenizer().add_rules(EMAIL_RULE, PHONE_RULE)\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Добавим собственное для извлечения доменов:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='на',\n", " span=[0, 2),\n", " type='RU'\n", " ),\n", " Token(\n", " value='сайте',\n", " span=[3, 8),\n", " type='RU'\n", " ),\n", " Token(\n", " value='www.VKontakte.ru',\n", " span=[9, 25),\n", " type='DOMAIN'\n", " )]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import TokenRule\n", "\n", "\n", "DOMAIN_RULE = TokenRule('DOMAIN', '[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+')\n", "\n", " \n", "text = 'на сайте www.VKontakte.ru'\n", "tokenizer = Tokenizer().add_rules(DOMAIN_RULE)\n", "list(tokenizer(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "По умолчанию, Yargy использует не `Tokenizer`, а `MorphTokenizer`. Для каждого токена с типом `'RU'` он запускает Pymorphy2, добавляет поле `forms` с морфологией:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Token(\n", " value='X',\n", " span=[0, 1),\n", " type='LATIN'\n", " ),\n", " MorphToken(\n", " value='век',\n", " span=[2, 5),\n", " type='RU',\n", " forms=[Form('век', Grams(NOUN,inan,masc,nomn,sing)),\n", " Form('век', Grams(NOUN,accs,inan,masc,sing)),\n", " Form('век', Grams(ADVB)),\n", " Form('веко', Grams(NOUN,gent,inan,neut,plur))]\n", " ),\n", " MorphToken(\n", " value='стал',\n", " span=[6, 10),\n", " type='RU',\n", " forms=[Form('стать', Grams(VERB,indc,intr,masc,past,perf,sing))]\n", " )]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from yargy.tokenizer import MorphTokenizer\n", "\n", "tokenizer = MorphTokenizer()\n", "list(tokenizer('X век стал'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Газеттир" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Словарь профессий, географических объектов можно записывать стандартные средствами через `rule`, `or_`, `normalized`, `caseless`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from yargy import rule, or_\n", "from yargy.predicates import normalized, caseless\n", "\n", "\n", "POSITION = or_(\n", " rule(normalized('генеральный'), normalized('директор')),\n", " rule(normalized('бухгалтер'))\n", ")\n", "\n", "GEO = or_(\n", " rule(normalized('Ростов'), '-', caseless('на'), '-', caseless('Дону')),\n", " rule(normalized('Москва'))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Это неудобно, легко ошибиться. Для составления словарей в Yargy используется `pipeline`. Реализовано два типа газеттиров: `morph_pipeline` и `caseless_pipeline`. `morph_pipeline` перед работой приводит слова к нормальной форме:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['электронным', 'дневником']\n", "['электронные', 'дневники']\n", "['электронное', 'дневнику']\n" ] } ], "source": [ "from yargy import Parser\n", "from yargy.pipelines import morph_pipeline\n", "\n", "\n", "TYPE = morph_pipeline(['электронный дневник'])\n", "\n", "parser = Parser(TYPE)\n", "text = 'электронным дневником, электронные дневники, электронное дневнику'\n", "for match in parser.findall(text):\n", " print([_.value for _ in match.tokens])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`caseless_pipeline` ищет слова без нормализации. Например, найдём в тексте арабские имена: \"Абд Аль-Азиз Бин Мухаммад\", \"Абд ар-Рахман Наср ас-Са ди\": " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Абд', 'Аль', '-', 'Азиз', 'Бин', 'Мухаммад']\n", "['АБД', 'АР', '-', 'РАХМАН', 'НАСР', 'АС', '-', 'СА', 'ДИ']\n" ] } ], "source": [ "from yargy.pipelines import caseless_pipeline\n", "\n", "\n", "NAME = caseless_pipeline([\n", " 'Абд Аль-Азиз Бин Мухаммад',\n", " 'Абд ар-Рахман Наср ас-Са ди'\n", "])\n", " \n", "parser = Parser(NAME)\n", "text = 'Абд Аль-Азиз Бин Мухаммад, АБД АР-РАХМАН НАСР АС-СА ДИ'\n", "for match in parser.findall(text):\n", " print([_.value for _ in match.tokens])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Предикаты" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
a == b\n",
"\n",
" >>> predicate = eq('1')\n",
" >>> token, = tokenize('1')\n",
" >>> predicate(token)\n",
" True\n",
" a.lower() == b.lower()\n",
"\n",
" >>> predicate = caseless('Рано')\n",
" >>> token, = tokenize('РАНО')\n",
" >>> predicate(token)\n",
" True\n",
" a in b\n",
"\n",
" >>> predicate = in_({'S', 'M', 'L'})\n",
" >>> a, b = tokenize('S 1')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" False\n",
" a.lower() in b\n",
"\n",
" >>> predicate = in_caseless({'S', 'M', 'L'})\n",
" >>> a, b = tokenize('S m')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" True\n",
" a >= b\n",
"\n",
" >>> predicate = gte(4)\n",
" >>> a, b, c = tokenize('3 5 C')\n",
" >>> predicate(a)\n",
" False\n",
" >>> predicate(b)\n",
" True\n",
" >>> predicate(c)\n",
" False\n",
" a <= b\n",
"\n",
" >>> predicate = lte(4)\n",
" >>> a, b, c = tokenize('3 5 C')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" False\n",
" >>> predicate(c)\n",
" False\n",
" len(a) == b\n",
"\n",
" >>> predicate = length_eq(3)\n",
" >>> a, b = tokenize('XXX 123')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" True\n",
" Нормальная форма слова == value\n",
"\n",
" >>> a = activate(normalized('сталь'))\n",
" >>> b = activate(normalized('стать'))\n",
" >>> token, = tokenize('стали')\n",
" >>> a(token)\n",
" True\n",
" >>> b(token)\n",
" True\n",
" Нормальная форма слова in value\n",
"\n",
" >>> predicate = activate(dictionary({'учитель', 'врач'}))\n",
" >>> a, b = tokenize('учителя врачи')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" True\n",
" value есть среди граммем слова\n",
"\n",
" >>> a = activate(gram('NOUN'))\n",
" >>> b = activate(gram('VERB'))\n",
" >>> token, = tokenize('стали')\n",
" >>> a(token)\n",
" True\n",
" >>> b(token)\n",
" True\n",
" Тип токена равен value\n",
"\n",
" >>> predicate = activate(type('INT'))\n",
" >>> a, b = tokenize('3 раза')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" False\n",
"Тег токена равен value\n", "
function в качестве предиката\n",
"\n",
" >>> from math import log\n",
" >>> f = lambda x: int(log(int(x), 10)) == 2\n",
" >>> predicate = activate(custom(f, types=INT))\n",
" >>> a, b = tokenize('12 123')\n",
" >>> predicate(a)\n",
" False\n",
" >>> predicate(b)\n",
" True\n",
" Всегда возвращает True\n",
"\n",
" >>> predicate = true()\n",
" >>> predicate(False)\n",
" True\n",
" str.islower\n",
"\n",
" >>> predicate = is_lower()\n",
" >>> a, b = tokenize('xxx Xxx')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" False\n",
" str.isupper\n",
"\n",
" >>> predicate = is_upper()\n",
" >>> a, b = tokenize('XXX xxx')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" False\n",
" str.istitle\n",
"\n",
" >>> predicate = is_title()\n",
" >>> a, b = tokenize('XXX Xxx')\n",
" >>> predicate(a)\n",
" False\n",
" >>> predicate(b)\n",
" True\n",
" Слово написано с большой буквы\n",
"\n",
" >>> predicate = is_capitalized()\n",
" >>> a, b, c = tokenize('Xxx XXX xxX')\n",
" >>> predicate(a)\n",
" True\n",
" >>> predicate(b)\n",
" True\n",
" >>> predicate(c)\n",
" False\n",
" Слово в единственном числе\n",
"\n",
" >>> predicate = is_single()\n",
" >>> token, = tokenize('слово')\n",
" >>> predicate(token)\n",
" True"
],
"text/plain": [
" {doc}'\n",
"\n",
"HTML('\\n'.join(html()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Интерпретация"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объект-результат интерпретации описывает конструктор `fact`. `attribute` задаёт значение поля по-умолчанию. Например, в `Date` по-умолчанию год будет равен 2017:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date(\n",
" year='2016',\n",
" month='июля',\n",
" day='18'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Date(\n",
" year=2017,\n",
" month='марта',\n",
" day='15'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from IPython.display import display\n",
"\n",
"from yargy import Parser, rule, and_, or_\n",
"from yargy.interpretation import fact, attribute\n",
"from yargy.predicates import dictionary, gte, lte\n",
"\n",
"\n",
"Date = fact(\n",
" 'Date',\n",
" [attribute('year', 2017), 'month', 'day']\n",
")\n",
"\n",
"\n",
"MONTHS = {\n",
" 'январь',\n",
" 'февраль',\n",
" 'март',\n",
" 'апрель',\n",
" 'мая',\n",
" 'июнь',\n",
" 'июль',\n",
" 'август',\n",
" 'сентябрь',\n",
" 'октябрь',\n",
" 'ноябрь',\n",
" 'декабрь'\n",
"}\n",
"\n",
"\n",
"MONTH_NAME = dictionary(MONTHS)\n",
"DAY = and_(\n",
" gte(1),\n",
" lte(31)\n",
")\n",
"YEAR = and_(\n",
" gte(1900),\n",
" lte(2100)\n",
")\n",
"DATE = rule(\n",
" DAY.interpretation(\n",
" Date.day\n",
" ),\n",
" MONTH_NAME.interpretation(\n",
" Date.month\n",
" ),\n",
" YEAR.interpretation(\n",
" Date.year\n",
" ).optional()\n",
").interpretation(\n",
" Date\n",
")\n",
"\n",
"\n",
"text = '''18 июля 2016\n",
"15 марта\n",
"'''\n",
"parser = Parser(DATE)\n",
"for line in text.splitlines():\n",
" match = parser.match(line)\n",
" display(match.fact)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Для дат деревья разбора выглядят просто: вершина-конструктор и несколько детей-атрибутов:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"parser = Parser(DATE)\n",
"for line in text.splitlines():\n",
" match = parser.match(line)\n",
" display(match.tree.as_dot)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Как будет себя вести алгоритм интерпретации, когда ребёнок конструктора не атрибут, а другой конструктор? Или когда ребёнок атрибута другой атрибут? Или когда под конструктором или атрибутом не одна, а несколько вершин с токенами? Пойдём от простого к сложному. Когда под вершиной-атрибутом несколько токенов, они объединяются:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy.predicates import eq, type, dictionary\n",
"\n",
"\n",
"Money = fact(\n",
" 'Money',\n",
" ['value', 'currency']\n",
")\n",
"MONEY = rule(\n",
" rule(\n",
" type('INT'),\n",
" dictionary({\n",
" 'тысяча',\n",
" 'миллион'\n",
" })\n",
" ).interpretation(\n",
" Money.value\n",
" ),\n",
" eq('$').interpretation(\n",
" Money.currency\n",
" )\n",
").interpretation(\n",
" Money\n",
")\n",
"\n",
"parser = Parser(MONEY)\n",
"match = parser.match('5 тысяч$')\n",
"match.tree.as_dot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В `Money.value` два слова:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Money(\n",
" value='5 тысяч',\n",
" currency='$'\n",
")"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Когда под вершиной-атрибутом смесь из токенов и вершин-конструктов, интерпретация кидает `TypeError`:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy.predicates import true\n",
"\n",
"\n",
"A = fact(\n",
" 'A',\n",
" ['x']\n",
")\n",
"B = fact(\n",
" 'B',\n",
" ['y']\n",
")\n",
"RULE = rule(\n",
" true(),\n",
" true().interpretation(\n",
" B.y\n",
" ).interpretation(\n",
" B\n",
" )\n",
").interpretation(\n",
" A.x\n",
").interpretation(\n",
" A\n",
")\n",
"\n",
"parser = Parser(RULE)\n",
"match = parser.match('X Y')\n",
"match.tree.as_dot"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# match.fact Будет TypeError"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Если под вершиной-атрибутом другая вершина-атрибут, нижняя просто исчезает:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy.predicates import true\n",
"\n",
"\n",
"A = fact(\n",
" 'A',\n",
" ['x', 'y']\n",
")\n",
"RULE = true().interpretation(\n",
" A.x\n",
").interpretation(\n",
" A.y\n",
").interpretation(A)\n",
"\n",
"parser = Parser(RULE)\n",
"match = parser.match('X')\n",
"match.tree.as_dot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"X\" попадёт в `A.y`, не в `A.x`:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A(\n",
" x=None,\n",
" y='X'\n",
")"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Что если под вершиной-конструктом несколько одинаковых вершин-атрибутов? Самый правый атрибут перезаписывает все остальные:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = fact(\n",
" 'A',\n",
" ['x']\n",
")\n",
"RULE = true().interpretation(\n",
" A.x\n",
").repeatable().interpretation(\n",
" A\n",
")\n",
"\n",
"parser = Parser(RULE)\n",
"match = parser.match('1 2 3')\n",
"match.tree.normalized.as_dot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В `A.x` попадёт \"3\":"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A(\n",
" x='3'\n",
")"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Но бывает нужно сохранить содержание всех повторяющихся вершин-атрибутов, не только самой правой. Помечаем поле как `repeatable`:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy import not_\n",
"\n",
"\n",
"Item = fact(\n",
" 'Item',\n",
" [attribute('titles').repeatable()]\n",
")\n",
"\n",
"TITLE = rule(\n",
" '«',\n",
" not_(eq('»')).repeatable(),\n",
" '»'\n",
")\n",
"ITEM = rule(\n",
" TITLE.interpretation(\n",
" Item.titles\n",
" ),\n",
" eq(',').optional()\n",
").repeatable().interpretation(\n",
" Item\n",
")\n",
"\n",
"parser = Parser(ITEM)\n",
"text = '«Каштанка», «Дядя Ваня»'\n",
"match = parser.match(text)\n",
"match.tree.as_dot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"«Дядя Ваня» не перезапишет «Каштанка», они оба окажутся в `Item.titles`:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Item(\n",
" titles=['«Каштанка»',\n",
" '«Дядя Ваня»']\n",
")"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Остался последний неочевидный случай, когда ребёнок вершины-конструктора, другая вершина-конструктор. Такая ситуация возникает при использовании рекурсивных грамматик. В примере ребёнок вершины `Item` другая вершина `Item`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy import forward, or_\n",
"\n",
"Item = fact(\n",
" 'Item',\n",
" ['title', 'date']\n",
")\n",
"\n",
"ITEM = forward().interpretation(\n",
" Item\n",
")\n",
"ITEM.define(or_(\n",
" TITLE.interpretation(\n",
" Item.title\n",
" ),\n",
" rule(ITEM, TITLE),\n",
" rule(\n",
" ITEM,\n",
" DATE.interpretation(\n",
" Item.date\n",
" )\n",
" )\n",
"))\n",
"\n",
"parser = Parser(ITEM)\n",
"text = '«Каштанка» 18 июня'\n",
"match = parser.match(text)\n",
"match.tree.as_dot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В ходе интерпретации появится два объекта: `Item(title='«Каштанка»', date=None)` и `Item(title=None, date=Date('18', 'июня'))`. В конце произойдёт слияние:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Item(\n",
" title='«Каштанка»',\n",
" date=Date(\n",
" year=2017,\n",
" month='июня',\n",
" day='18'\n",
" )\n",
")"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Нормализация"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В Yargy реализованы четыре основных метода для нормализации: `normalized`, `inflected`, `custom` и `const`. `normalized` возвращает нормальную форму слова, соответствует `normal_form` в Pymorphy2:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date(\n",
" year='2015',\n",
" month='июня',\n",
" day='8'\n",
")"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DATE = rule(\n",
" DAY.interpretation(\n",
" Date.day\n",
" ),\n",
" MONTH_NAME.interpretation(\n",
" Date.month\n",
" ),\n",
" YEAR.interpretation(\n",
" Date.year\n",
" )\n",
").interpretation(\n",
" Date\n",
")\n",
"\n",
"parser = Parser(DATE)\n",
"match = parser.match('8 июня 2015')\n",
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"С `normalized` слово \"июня\" меняется на \"июнь\":"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date(\n",
" year='2015',\n",
" month='июнь',\n",
" day='8'\n",
")"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DATE = rule(\n",
" DAY.interpretation(\n",
" Date.day\n",
" ),\n",
" MONTH_NAME.interpretation(\n",
" Date.month.normalized()\n",
" ),\n",
" YEAR.interpretation(\n",
" Date.year\n",
" )\n",
").interpretation(\n",
" Date\n",
")\n",
"\n",
"parser = Parser(DATE)\n",
"match = parser.match('8 июня 2015')\n",
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Если в `normalized` попадает несколько токенов, каждый приводится к нормальной форме без согласования:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Geo(\n",
" name='красный площадь'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from yargy.interpretation import fact\n",
"from yargy.predicates import normalized\n",
"from IPython.display import display\n",
"\n",
"\n",
"Geo = fact(\n",
" 'Geo',\n",
" ['name']\n",
")\n",
"\n",
"RULE = rule(\n",
" normalized('Красная'),\n",
" normalized('площадь')\n",
").interpretation(\n",
" Geo.name.normalized()\n",
").interpretation(\n",
" Geo\n",
")\n",
"\n",
"parser = Parser(RULE)\n",
"for match in parser.findall('на Красной площади'):\n",
" display(match.fact)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Особым образом ведёт себя `normalized`, когда идёт после газеттира. Результат нормализации — ключ газеттира:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Geo(\n",
" name='красная площадь'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Geo(\n",
" name='первомайская улица'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from yargy.pipelines import morph_pipeline\n",
"\n",
"RULE = morph_pipeline([\n",
" 'красная площадь',\n",
" 'первомайская улица'\n",
"]).interpretation(\n",
" Geo.name.normalized()\n",
").interpretation(\n",
" Geo\n",
")\n",
"\n",
"parser = Parser(RULE)\n",
"for match in parser.findall('c Красной площади на Первомайскую улицу'):\n",
" display(match.fact)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`inflected` склоняет слово, соответствует методу `inflect` в Pymorphy2:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Name(\n",
" first='саша'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Name(\n",
" first='маша'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Name(\n",
" first='вадим'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from yargy.interpretation import fact\n",
"from yargy.predicates import gram\n",
"\n",
"Name = fact(\n",
" 'Name',\n",
" ['first']\n",
")\n",
"\n",
"NAME = gram('Name').interpretation(\n",
" Name.first.inflected()\n",
").interpretation(\n",
" Name\n",
")\n",
"\n",
"parser = Parser(NAME)\n",
"for match in parser.findall('Саше, Маше, Вадиму'):\n",
" display(match.fact)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`inflected` принимает набор граммем:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Name(\n",
" first='саш'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Name(\n",
" first='маш'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Name(\n",
" first='вадимов'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"NAME = gram('Name').interpretation(\n",
" Name.first.inflected({'accs', 'plur'}) # винительный падеж, множественное число\n",
").interpretation(\n",
" Name\n",
")\n",
"\n",
"parser = Parser(NAME)\n",
"for match in parser.findall('Саша, Маша, Вадим'):\n",
" display(match.fact)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`custom` применяет к слову произвольную функцию:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Float(\n",
" value=3.1415\n",
")"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy.interpretation import fact\n",
"from yargy.predicates import type\n",
"\n",
"Float = fact(\n",
" 'Float',\n",
" ['value']\n",
")\n",
"\n",
"\n",
"INT = type('INT')\n",
"FLOAT = rule(\n",
" INT,\n",
" '.',\n",
" INT\n",
").interpretation(\n",
" Float.value.custom(float)\n",
").interpretation(\n",
" Float\n",
")\n",
"\n",
"parser = Parser(FLOAT)\n",
"match = parser.match('3.1415')\n",
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`custom` может применяться вместе с `normalized`. Тогда слово начала ставится в нормальную форму, потом к нему применяется функция:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date(\n",
" year=2015,\n",
" month=6,\n",
" day=8\n",
")"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MONTHS = {\n",
" 'январь': 1,\n",
" 'февраль': 2,\n",
" 'март': 3,\n",
" 'апрель': 4,\n",
" 'мая': 5,\n",
" 'июнь': 6,\n",
" 'июль': 7,\n",
" 'август': 8,\n",
" 'сентябрь': 9,\n",
" 'октябрь': 10,\n",
" 'ноябрь': 11,\n",
" 'декабрь': 12\n",
"}\n",
"\n",
"DATE = rule(\n",
" DAY.interpretation(\n",
" Date.day.custom(int)\n",
" ),\n",
" MONTH_NAME.interpretation(\n",
" Date.month.normalized().custom(MONTHS.__getitem__)\n",
" ),\n",
" YEAR.interpretation(\n",
" Date.year.custom(int)\n",
" )\n",
").interpretation(\n",
" Date\n",
")\n",
"\n",
"parser = Parser(DATE)\n",
"match = parser.match('8 июня 2015')\n",
"match.fact"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`const` просто заменяет слово или словосочетания фиксированным значением:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Era(\n",
" value='AD'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Era(\n",
" value='BC'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"Era = fact(\n",
" 'Era',\n",
" ['value']\n",
")\n",
"\n",
"BC = morph_pipeline([\n",
" 'до нашей эры',\n",
" 'до н.э.'\n",
"]).interpretation(\n",
" Era.value.const('BC')\n",
")\n",
"AD = morph_pipeline([\n",
" 'наша эра',\n",
" 'н.э.'\n",
"]).interpretation(\n",
" Era.value.const('AD')\n",
")\n",
"ERA = or_(\n",
" BC,\n",
" AD\n",
").interpretation(\n",
" Era\n",
")\n",
"\n",
"parser = Parser(ERA)\n",
"for match in parser.findall('наша эра, до н.э.'):\n",
" display(match.fact)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Согласование"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В Yargy реализовано четыре типа согласований: `gender_relation` — согласование по роду, `number_relation` — по числу, `case_relation` — по падежу, `gnc_relation` — по роду, числу и падежу. Метод `match` указывает согласование:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Name(\n",
" first='саша',\n",
" last='иванова'\n",
")"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from yargy.relations import gnc_relation\n",
"\n",
"Name = fact(\n",
" 'Name',\n",
" ['first', 'last']\n",
")\n",
"\n",
"gnc = gnc_relation()\n",
"\n",
"NAME = rule(\n",
" gram('Name').interpretation(\n",
" Name.first.inflected()\n",
" ).match(gnc),\n",
" gram('Surn').interpretation(\n",
" Name.last.inflected()\n",
" ).match(gnc)\n",
").interpretation(\n",
" Name\n",
")\n",
"\n",
"parser = Parser(NAME)\n",
"match = parser.match('Сашу Иванову')\n",
"display(match.fact)\n",
"display(match.tree.as_dot)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`main` указывает на главное слово во фразе. По-умолчанию главное слово — самое левое:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy.relations import main\n",
"\n",
"POSITION = rule(\n",
" normalized('главный'),\n",
" main(normalized('бухгалтер'))\n",
")\n",
"\n",
"POSITION.as_dot"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"Graph(nodes=[...], edges=[...])"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from yargy.relations import case_relation\n",
"\n",
"case = case_relation()\n",
"\n",
"PERSON = rule(\n",
" POSITION.match(case),\n",
" NAME.match(case)\n",
")\n",
"\n",
"\n",
"parser = Parser(PERSON)\n",
"assert not parser.match('главного бухгалтер марину игореву')\n",
"\n",
"match = parser.match('главного бухгалтера марину игореву')\n",
"match.tree.as_dot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}