{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentence Structure \n", "\n", "*These notes will be heavily based on this [book](https://www.nltk.org/book/).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Goals\n", "\n", "The goals of this notebook are:\n", "- Understand how to use a formal grammar to describe the structure of sentences\n", "- Use syntax trees to represent the structure of sentences\n", "- Understand how parser analyze sentences and build a syntax tree\n", "\n", "Motivated by the task of **natural language understanding**, we will show how to develop formal models for patterns in sequence of words using grammars and parsers. We want to understand how recognizing linguistic structures can help understanding the meaning of a text. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Framework\n", "\n", "Ths framework will be of generative grammar. From Wikipedia:\n", "\n", "> Generative grammar is a linguistic theory that regards grammar as a system of rules that generates exactly those combinations of words that form grammatical sentences in a given language.\n", "\n", "In other words:\n", "> A language is the collection of all grammatical sentences and a grammar is a formal notation that can be used to generate the members of this set.\n", "\n", "As will be explained, grammars use recursive **productions** of the form $S \\to S \\,\\text{and}\\, S$.\n", "\n", "\n", "### Sentence structure and meaning\n", "\n", "Following [book](https://www.nltk.org/book/) I will use the following quote my G. Marx to illustrate ambiguities found in language:\n", "\n", "*\"While hunting in Africa, **I shot an elephant in my pajamas**. How he got into my pajamas, I don't know.\"* (G. Marx)\n", "\n", "To examine the ambiguous phrase sentence in bold we first define a grammar. About the syntax, `nltk.CFG.fromstring` returns the context-free grammar (CFG) corresponding to the input string. In the next section CFG will be discussed in detail but a quick definition is:\n", "\n", "> In formal language theory, a context-free grammar (CFG) is a certain type of formal grammar: a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "groucho_grammar = nltk.CFG.fromstring(\"\"\"\n", "S -> NP VP\n", "PP -> P NP\n", "NP -> Det N | Det N PP | 'I'\n", "VP -> V NP | VP PP\n", "Det -> 'an' | 'my'\n", "N -> 'elephant' | 'pajamas'\n", "V -> 'shot'\n", "P -> 'in'\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The meaning of this symbols can be found [here](https://github.com/marcotav/alphabet-human-thought/blob/master/images/syntactic-categories.png). Using `nltk.ChartParser` we generate:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'I shot an elephant in my pajamas'.split(' ')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S\n", " (NP I)\n", " (VP\n", " (VP (V shot) (NP (Det an) (N elephant)))\n", " (PP (P in) (NP (Det my) (N pajamas)))))\n", "(S\n", " (NP I)\n", " (VP\n", " (V shot)\n", " (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))\n" ] } ], "source": [ "sentence = 'I shot an elephant in my pajamas'.split(' ')\n", "parser = nltk.ChartParser(groucho_grammar)\n", "for tree in parser.parse(sentence):\n", " print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A graph of the tree structure can be found [here](https://github.com/marcotav/alphabet-human-thought/blob/master/images/tree_cfg.png). The two trees differ with respect to where the shooting occured:\n", "- In the first the elephant was inside the pajamas and the shot is a camera shot.\n", "- In the second the elephant was shot with a gunby a shooter using pajamas.\n", "\n", "### We see that identifying the structure of sentences, their meaning becomes more easily identifiable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Context-Free Grammar (CFG)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grammar1 = nltk.CFG.fromstring(\"\"\"\n", " S -> NP VP\n", " VP -> V NP | V NP PP\n", " PP -> P NP\n", " V -> \"saw\" | \"ate\" | \"walked\"\n", " NP -> \"John\" | \"Mary\" | \"Bob\" | Det N | Det N PP\n", " Det -> \"a\" | \"an\" | \"the\" | \"my\"\n", " N -> \"man\" | \"dog\" | \"cat\" | \"telescope\" | \"park\"\n", " P -> \"in\" | \"on\" | \"by\" | \"with\"\n", " \"\"\")\n", "\n", "grammar1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `RecursiveDescentParser` used below is\n", "> A simple top-down CFG parser that parses texts by recursively expanding the fringe of a Tree, \n", "and matching it against a text." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S (NP Mary) (VP (V saw) (NP Bob)))\n" ] } ], "source": [ "sentence = \"Mary saw Bob\".split()\n", "rd_parser = nltk.RecursiveDescentParser(grammar1)\n", "\n", "for tree in rd_parser.parse(sentence):\n", " print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Node $S$ splits into node $(NP \\,{\\text{Mary}})$ and node $VP$. Node $VP$ splits into $(V \\,{\\text{saw}})$ and $(NP \\,{\\text{Bob}})$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the sentence \"the dog saw a man in a park\" using the grammar above given origin to two trees because of its ambiguous structure, in this case, preposicional phrase attachment ambiguity." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S\n", " (NP (Det the) (N dog))\n", " (VP (V saw) (NP (Det a) (N man) (PP (P in) (NP (Det a) (N park))))))\n", "(S\n", " (NP (Det the) (N dog))\n", " (VP (V saw) (NP (Det a) (N man)) (PP (P in) (NP (Det a) (N park)))))\n" ] } ], "source": [ "sentence = \"the dog saw a man in a park\".split()\n", "rd_parser = nltk.RecursiveDescentParser(grammar1)\n", "\n", "for tree in rd_parser.parse(sentence):\n", " print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first tree, the seeing occurred in the park. In this case, NP is \"the dog\" and the seeing act refers to it.\n", "\n", " [the dog] [saw a man in the park]\n", "\n", "In the second, the man was in the park but the dog could be outside, looking at the park. In this case, the first of the right branch is \"a man in the park\". So the sentence is:\n", "\n", " [the dog saw][a man in a park]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursive Descent Parser\n", "\n", "A parser is an interpretation of the grammar and looks for the required sentence along the fringe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TBC " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }