{ "metadata": { "name": "", "signature": "sha256:10f40dbd7909c2854d2fec956b9be656d79edfa721af91791a15dc1ba83eae0e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Version of the Eliza Program with a part-of-speech tagging preprocessor to transform input text with more intelligence." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import random\n", "import re" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This file takes advantage of Python's `pickle` feature, which allows you to dump an object that you've created to a binary file and then load it in later.\n", "\n", "In this case, we're loading a [part of speech tagger][1], which is a program that takes a list of words (and other tokens from the input sentence) and analyzes them to disambiguate the basic grammatical role that they play in the sentence.\n", "\n", "I built this part of speech tagger using [NLTK][2], which is freely available. NLTK comes with several taggers built in, but none does the kind of disambiguation you'd want for Eliza, which is why I built it myself.\n", "\n", "I trained the tagger on [the Brown Corpus][3], which has been marked up with relatively rich part of speech tags. You can see a list of those tags [here][4]. Two things are particularly useful about this tag set. \n", "- First, it distinguishes between subject pronouns and object pronouns (using the tags `PPSS` for a subject pronoun and `PPO` for an object pronoun). This means that you can look at how the word _you_ is tagged to decide whether to swap it with _I_ or _me_. \n", "- Second, it indicates whether nouns are common (tags including the symbols `NN`) or proper (tags including the symbols `NP`). That means you can tell whether a word that's capitalized at the beginning of a sentence deserves to stay capitalized when you repeat it in the middle of a sentence.\n", "\n", "Here is the code that I used to build the tagger and save it to a pickle.\n", "```python\n", "import nltk\n", "from pickle import dump\n", "unigram_tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())\n", "brill_tagger_trainer = nltk.tag.brill_trainer.BrillTaggerTrainer(unigram_tagger, \n", " nltk.tag.brill.brill24())\n", "tagger = brill_tagger_trainer.train(nltk.corpus.brown.tagged_sents(), max_rules=1000)\n", "outfile = open(\"bbt.pkl\", \"wb\")\n", "dump(tagger, outfile, -1)\n", "outfile.close()\n", "```\n", "As you can see, all the work to get this going is pretty much done by NLTK. But you should know a little bit about what's going on under the hood. We're using an algorithm called [Brill Tagging][5]. The NLTK book also talks about this in [Section 6 of Chapter 5][6]. This algorithm starts from a simple tagger -- here we're using the `UnigramTagger` that just tags each symbol with its most likely reading, ignoring the surrounding context. Then we use the `BrillTaggerTraininer` to find a collection of rules that correct mistakes in the default tagging. These rules can use a variety of features including what words are nearby and what parts of speech are nearby. We use our training data to search through the possible rules and find the ones that work the best at correcting the errors of the unigram tagger. This took several hours to run (I'm not sure how long because I didn't stick around to wait). So it was important to store the result as a pickle for whenever we need it later. This also has the advantage that the tagger we store no longer depends on NLTK so it's easier to use. \n", "\n", "[1]:http://en.wikipedia.org/wiki/Part-of-speech_tagging\n", "[2]:http://www.nltk.org/\n", "[3]:http://en.wikipedia.org/wiki/Brown_Corpus\n", "[4]:http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html\n", "[5]:http://en.wikipedia.org/wiki/Brill_tagger\n", "[6]:http://www.nltk.org/book/ch05.html\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pickle import load\n", "infile = open(\"bbt.pkl\", \"rb\")\n", "tagger = load(infile)\n", "infile.close()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we load in the pickle as `tagger`, we're in business. We convert the user's utterance into a list of tokens `L` and call `tagger.tag(L)`. What we get back is a list of pairs, where the first item is the word and the second item is the part of speech that the tagger inferred for the word. \n", "\n", "Here's quickie code to split a string up into suitable tokens. It's a little annoying because of the strange ways we write punctuation in English, and because of the funny way that Python's `split` utility winds up returning lots of copies of the empty string when matching complex conditional patterns." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def tokenize(text) :\n", " return [tok for tok in re.split(r\"\"\"([,.;:?\"]?) # optionally, common punctuation \n", " (['\"]*) # optionally, closing quotation marks\n", " (?:\\s|\\A|\\Z) # necessarily: space or start or end\n", " ([\"`']*) # snarf opening quotes on the next word\n", " \"\"\", \n", " text, \n", " flags=re.VERBOSE)\n", " if tok != '']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we restrict our translation just to the unambiguous cases:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "untagged_reflection_of = {\n", " \"am\" : \"are\",\n", " \"i\" : \"you\",\n", " \"i'd\" : \"you would\",\n", " \"i've\" : \"you have\",\n", " \"i'll\" : \"you will\",\n", " \"i'm\" : \"you are\",\n", " \"my\" : \"your\",\n", " \"me\" : \"you\",\n", " \"you've\": \"I have\",\n", " \"you'll\": \"I will\",\n", " \"you're\": \"I am\",\n", " \"your\" : \"my\",\n", " \"yours\" : \"mine\"}\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We handle the ambiguous cases in a couple of different ways. First, there are some tokens that we can now map conditionally just depending on the way they were tagged." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tagged_reflection_of = {\n", " (\"you\", \"PPSS\") : \"I\",\n", " (\"you\", \"PPO\") : \"me\"\n", "}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the code to translate individual tokens, handling capitalization using the proper name key (`NP`) that fits proper name tags." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def translate_token((word, tag)) :\n", " wl = word.lower()\n", " if (wl, tag) in tagged_reflection_of :\n", " return (tagged_reflection_of[wl, tag], tag)\n", " if wl in untagged_reflection_of :\n", " return (untagged_reflection_of[wl], tag)\n", " if tag.find(\"NP\") < 0 :\n", " return (wl, tag)\n", " return (word, tag)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, handling verbs like _are_ and _were_ is more complicated, because the tagger does not indicate whether they have a second person subject like _you_ or a third person plural subject like _they_. However, we're lucky because in English the subject is usually pretty close to the verb and there's no way to modify words like _you_ so that material intervenes. So we can just look for the noun phrase that's nearest the verb and use that to infer whether the subject of the verb is one of the pronouns we want to target." ] }, { "cell_type": "code", "collapsed": false, "input": [ "subject_tags = [\"PPS\", # he, she, it\n", " \"PPSS\", # you, we, they\n", " \"PN\", # everyone, someone\n", " \"NN\", # dog, cat\n", " \"NNS\", # dogs, cats\n", " \"NP\", # Fred, Jane\n", " \"NPS\" # Republicans, Democrats\n", " ]\n", "\n", "def swap_ambiguous_verb(tagged_words, tagged_verb_form, target_subject_pronoun, replacement) :\n", " for i, (w, t) in enumerate(tagged_words) :\n", " if (w, t) == tagged_verb_form :\n", " j = i - 1\n", " # look earlier for the subject\n", " while j >= 0 and tagged_words[j][1] not in subject_tags :\n", " j = j - 1\n", " # if subject is the target, swap verb forms\n", " if j >= 0 and tagged_words[j][0].lower() == target_subject_pronoun :\n", " tagged_words[i] = replacement\n", " # didn't find a subject before the verb, so probably a question \n", " if j < 0 :\n", " j = i + 1\n", " while j < len(tagged_words) and tagged_words[j][1] not in subject_tags :\n", " j = j + 1\n", " # if subject is the target, swap verb forms\n", " if j < len(tagged_words) and tagged_words[j][0].lower() == target_subject_pronoun :\n", " tagged_words[i] = replacement" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are four cases that we need to deal with: fixing \"are\", \"am\", \"were\" and \"was\" when we've changed their subjects to violate English agreement. We also have to fix some punctuation." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def handle_specials(tagged_words) :\n", " # don't keep punctuation at the end\n", " while tagged_words[-1][1] == '.' :\n", " tagged_words.pop()\n", " # replace verb \"be\" to agree with swapped subjects\n", " swap_ambiguous_verb(tagged_words, (\"are\", \"BER\"), \"i\", (\"am\", \"BEM\"))\n", " swap_ambiguous_verb(tagged_words, (\"am\", \"BEM\"), \"you\", (\"are\", \"BER\"))\n", " swap_ambiguous_verb(tagged_words, (\"were\", \"BED\"), \"i\", (\"was\", \"BEDZ\"))\n", " swap_ambiguous_verb(tagged_words, (\"was\", \"BEDZ\"), \"you\", (\"were\", \"BED\"))\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we put it all together. We expand the sentence into tokens, tag the tokens, translate using the tags, deal with the verbs, and then put things back together. \n", "\n", "Fortunately, the tagger can alert us to the presence of punctuation of various types, so we know where the spaces belong in the output!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "close_punc = ['.', ',', \"''\"]\n", "def translate(this):\n", " tokens = tokenize(this)\n", " tagged_tokens = tagger.tag(tokens)\n", " translation = [translate_token(tt) for tt in tagged_tokens]\n", " handle_specials(translation)\n", " if len(translation) > 0 :\n", " with_spaces = [translation[0][0]]\n", " for i in range(1, len(translation)) :\n", " if translation[i-1][1] != '``' and translation[i][1] not in close_punc :\n", " with_spaces.append(' ')\n", " with_spaces.append(translation[i][0]) \n", " return ''.join(with_spaces)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regular expressions are exactly the same as before. So is the code to generate a response!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "rules = [(re.compile(x[0]), x[1]) for x in [\n", " ['How are you?',\n", " [ \"I'm fine, thank you.\"]],\n", " [\"I need (.*)\",\n", " [ \"Why do you need %1?\",\n", " \"Would it really help you to get %1?\",\n", " \"Are you sure you need %1?\"]],\n", " [\"Why don't you (.*)\",\n", " [ \"Do you really think I don't %1?\",\n", " \"Perhaps eventually I will %1.\",\n", " \"Do you really want me to %1?\"]],\n", " [\"Why can't I (.*)\",\n", " [ \"Do you think you should be able to %1?\",\n", " \"If you could %1, what would you do?\",\n", " \"I don't know -- why can't you %1?\",\n", " \"Have you really tried?\"]],\n", " [\"I can't (.*)\",\n", " [ \"How do you know you can't %1?\",\n", " \"Perhaps you could %1 if you tried.\",\n", " \"What would it take for you to %1?\"]],\n", " [\"I am (.*)\",\n", " [ \"Did you come to me because you are %1?\",\n", " \"How long have you been %1?\",\n", " \"How do you feel about being %1?\"]],\n", " [\"I'm (.*)\",\n", " [ \"How does being %1 make you feel?\",\n", " \"Do you enjoy being %1?\",\n", " \"Why do you tell me you're %1?\",\n", " \"Why do you think you're %1?\"]],\n", " [\"Are you (.*)\",\n", " [ \"Why does it matter whether I am %1?\",\n", " \"Would you prefer it if I were not %1?\",\n", " \"Perhaps you believe I am %1.\",\n", " \"I may be %1 -- what do you think?\"]],\n", " [\"What (.*)\",\n", " [ \"Why do you ask?\",\n", " \"How would an answer to that help you?\",\n", " \"What do you think?\"]],\n", " [\"How (.*)\",\n", " [ \"How do you suppose?\",\n", " \"Perhaps you can answer your own question.\",\n", " \"What is it you're really asking?\"]],\n", " [\"Because (.*)\",\n", " [ \"Is that the real reason?\",\n", " \"What other reasons come to mind?\",\n", " \"Does that reason apply to anything else?\",\n", " \"If %1, what else must be true?\"]],\n", " [\"(.*) sorry (.*)\",\n", " [ \"There are many times when no apology is needed.\",\n", " \"What feelings do you have when you apologize?\"]],\n", " [\"Hello(.*)\",\n", " [ \"Hello... I'm glad you could drop by today.\",\n", " \"Hi there... how are you today?\",\n", " \"Hello, how are you feeling today?\"]],\n", " [\"I think (.*)\",\n", " [ \"Do you doubt %1?\",\n", " \"Do you really think so?\",\n", " \"But you're not sure %1?\"]],\n", " [\"(.*) friend(.*)\",\n", " [ \"Tell me more about your friends.\",\n", " \"When you think of a friend, what comes to mind?\",\n", " \"Why don't you tell me about a childhood friend?\"]],\n", " [\"Yes\",\n", " [ \"You seem quite sure.\",\n", " \"OK, but can you elaborate a bit?\"]],\n", " [\"No\",\n", " [ \"Why not?\"]],\n", " [\"(.*) computer(.*)\",\n", " [ \"Are you really talking about me?\",\n", " \"Does it seem strange to talk to a computer?\",\n", " \"How do computers make you feel?\",\n", " \"Do you feel threatened by computers?\"]],\n", " [\"Is it (.*)\",\n", " [ \"Do you think it is %1?\",\n", " \"Perhaps it's %1 -- what do you think?\",\n", " \"If it were %1, what would you do?\",\n", " \"It could well be that %1.\"]],\n", " [\"It is (.*)\",\n", " [ \"You seem very certain.\",\n", " \"If I told you that it probably isn't %1, what would you feel?\"]],\n", " [\"Can you (.*)\",\n", " [ \"What makes you think I can't %1?\",\n", " \"If I could %1, then what?\",\n", " \"Why do you ask if I can %1?\"]],\n", " [\"Can I (.*)\",\n", " [ \"Perhaps you don't want to %1.\",\n", " \"Do you want to be able to %1?\",\n", " \"If you could %1, would you?\"]],\n", " [\"You are (.*)\",\n", " [ \"Why do you think I am %1?\",\n", " \"Does it please you to think that I'm %1?\",\n", " \"Perhaps you would like me to be %1.\",\n", " \"Perhaps you're really talking about yourself?\"]],\n", " [\"You're (.*)\",\n", " [ \"Why do you say I am %1?\",\n", " \"Why do you think I am %1?\",\n", " \"Are we talking about you, or me?\"]],\n", " [\"I don't (.*)\",\n", " [ \"Don't you really %1?\",\n", " \"Why don't you %1?\",\n", " \"Do you want to %1?\"]],\n", " [\"I feel (.*)\",\n", " [ \"Good, tell me more about these feelings.\",\n", " \"Do you often feel %1?\",\n", " \"When do you usually feel %1?\",\n", " \"When you feel %1, what do you do?\"]],\n", " [\"I have (.*)\",\n", " [ \"Why do you tell me that you've %1?\",\n", " \"Have you really %1?\",\n", " \"Now that you have %1, what will you do next?\"]],\n", " [\"I would (.*)\",\n", " [ \"Could you explain why you would %1?\",\n", " \"Why would you %1?\",\n", " \"Who else knows that you would %1?\"]],\n", " [\"Is there (.*)\",\n", " [ \"Do you think there is %1?\",\n", " \"It's likely that there is %1.\",\n", " \"Would you like there to be %1?\"]],\n", " [\"My (.*)\",\n", " [ \"I see, your %1.\",\n", " \"Why do you say that your %1?\",\n", " \"When your %1, how do you feel?\"]],\n", " [\"You (.*)\",\n", " [ \"We should be discussing you, not me.\",\n", " \"Why do you say that about me?\",\n", " \"Why do you care whether I %1?\"]],\n", " [\"Why (.*)\",\n", " [ \"Why don't you tell me the reason why %1?\",\n", " \"Why do you think %1?\" ]],\n", " [\"I want (.*)\",\n", " [ \"What would it mean to you if you got %1?\",\n", " \"Why do you want %1?\",\n", " \"What would you do if you got %1?\",\n", " \"If you got %1, then what would you do?\"]],\n", " [\"(.*) mother(.*)\",\n", " [ \"Tell me more about your mother.\",\n", " \"What was your relationship with your mother like?\",\n", " \"How do you feel about your mother?\",\n", " \"How does this relate to your feelings today?\",\n", " \"Good family relations are important.\"]],\n", " [\"(.*) father(.*)\",\n", " [ \"Tell me more about your father.\",\n", " \"How did your father make you feel?\",\n", " \"How do you feel about your father?\",\n", " \"Does your relationship with your father relate to your feelings today?\",\n", " \"Do you have trouble showing affection with your family?\"]],\n", " [\"(.*) child(.*)\",\n", " [ \"Did you have close friends as a child?\",\n", " \"What is your favorite childhood memory?\",\n", " \"Do you remember any dreams or nightmares from childhood?\",\n", " \"Did the other children sometimes tease you?\",\n", " \"How do you think your childhood experiences relate to your feelings today?\"]],\n", " [\"(.*)\\?\",\n", " [ \"Why do you ask that?\",\n", " \"Please consider whether you can answer your own question.\",\n", " \"Perhaps the answer lies within yourself?\",\n", " \"Why don't you tell me?\"]],\n", " [\"quit\",\n", " [ \"Thank you for talking with me.\",\n", " \"Good-bye.\",\n", " \"Thank you, that will be $150. Have a good day!\"]],\n", " [\"(.*)\",\n", " [ \"Please tell me more.\",\n", " \"Let's change focus a bit... Tell me about your family.\",\n", " \"Can you elaborate on that?\",\n", " \"Why do you say that %1?\",\n", " \"I see.\",\n", " \"Very interesting.\",\n", " \"So %1.\",\n", " \"I see. And what does that tell you?\",\n", " \"How does that make you feel?\",\n", " \"How do you feel when you say that?\"]]\n", "]]\n", "\n", "def respond(sentence):\n", " # find a match among keys, last one is quaranteed to match.\n", " for rule, value in rules:\n", " match = rule.search(sentence)\n", " if match is not None:\n", " # found a match ... stuff with corresponding value\n", " # chosen randomly from among the available options\n", " resp = random.choice(value)\n", " # we've got a response... stuff in reflected text where indicated\n", " while '%' in resp:\n", " pos = resp.find('%')\n", " num = int(resp[pos+1:pos+2])\n", " resp = resp.replace(resp[pos:pos+2], translate(match.group(num)))\n", " return resp\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reference, here's the code to add to make the program interactive again.\n", "```python\n", "if __name__ == '__main__':\n", " print(\"\"\"\n", "Therapist\n", "---------\n", "Talk to the program by typing in plain English, using normal upper-\n", "and lower-case letters and punctuation. Enter \"quit\" when done.'\"\"\")\n", " print('='*72)\n", " print(\"Hello. How are you feeling today?\")\n", " s = \"\"\n", " while s != \"quit\":\n", " s = input(\">\")\n", " while s and s[-1] in \"!.\":\n", " s = s[:-1]\n", " \n", " print(respond(s))\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The responses below illustrate some of the improvements you can get in Eliza's responses using the new translation method." ] }, { "cell_type": "code", "collapsed": false, "input": [ "respond(\"My mother hates me.\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "'When your mother hates you, how do you feel?'" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "respond(\"I'm ``possibly,'' maybe crazy.\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "\"Do you enjoy being ``possibly,'' maybe crazy?\"" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "respond(\"My dog was crazy.\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "'Why do you say that your dog was crazy?'" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "respond(\"I was crazy\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 20, "text": [ "'So you were crazy.'" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "respond(\"You said Fred was crazy.\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "'Why do you care whether I said Fred was crazy?'" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "respond(\"I asked you.\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 35, "text": [ "'Why do you say that you asked me?'" ] } ], "prompt_number": 35 } ], "metadata": {} } ] }