{ "metadata": { "name": "", "signature": "sha256:ab5bd899974c2904d36726cc9b828cc17736191326442e39bcc6927d83e68764" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This file shows how you can use a corpus of utterances and information retrieval techniques to try to make a chatbot respond to a user in a relevant way.\n", "\n", "The main idea here is the [vector space model of documents and queries][1], along with the [term frequency - inverse document frequency (TFIDF) representation for modeling document topics][2] and [the cosine metric][3] for measuring document similarity. These techniques are old - the vector space model goes back to the '60s and TFIDF weighting was established in the '70s. There are many improvements to information retrieval models since then, including Google's [PageRank][4] model to weight documents based on their importance and a variety of improved statistical models of document topics. However, these techniques still provide an excellent starting point for accessing items from text collections that are relevant to a topic, and they're very easy to implement and use!\n", "\n", "This particular chatbot draws on a book of _Miscellaneous Aphorisms_ that Oscar Wilde published in 1911. Most of the work comes in preprocessing this collection to index its candidate utterances using the TFIDF model so we can easily find the utterance that's most similar to what the user has just said.\n", "\n", "[1]:http://en.wikipedia.org/wiki/Vector_space_model\n", "[2]:http://en.wikipedia.org/wiki/Tf%E2%80%93idf\n", "[3]:http://en.wikipedia.org/wiki/Cosine_similarity\n", "[4]:http://en.wikipedia.org/wiki/PageRank\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re, collections, math, random" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file of responses is formatted as a sequence of paragraphs, separated by blank lines. It's convenient to represent the utterances this way because many of them are quite long, and this way you can use the newlines in the file to have the utterances preformatted for the screen. This particular `paragraphs` detector uses Python's `generator` constructs to save its state as it successively returns chunks of data from the file. The code originates with Alex Martelli and Magnus Lie Hetland and the _Python Cookbook_." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def paragraphs(file, separator=None):\n", " if not callable(separator):\n", " if separator != None: raise TypeError, \"separator must be callable\"\n", " def separator(line): return line == '\\n'\n", " paragraph = []\n", " for line in file:\n", " if separator(line):\n", " if paragraph:\n", " yield ''.join(paragraph)\n", " paragraph = []\n", " else:\n", " paragraph.append(line)\n", " if paragraph :\n", " yield ''.join(paragraph)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the things the chatbot will be able to say as paragraphs from the specified file.\n", "`strip()` eliminates any final newlines." ] }, { "cell_type": "code", "collapsed": false, "input": [ "response_file = \"oscar-wilde-quotes.txt\"\n", "with open(response_file) as f :\n", " responses = [p.strip() for p in paragraphs(f)]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Information retrieval models of similarity focus on the important words. There are many words in English text that occur frequently and that don't carry a lot of information about the topic of the text. Such words are known as _stop words_ in information retrieval, and I've downloaded a common collection of stop words [from a kind of random looking website][1] for use here.\n", "\n", "[1]:http://xpo6.com/download-stop-word-list/\n", "\n", "We just create a set of all the stop words to exclude them from our vocabulary indexing later. In particular, we write a helper function `content_words` that takes a string, breaks it into the words that make it up in a simple way (we can ignore punctuation for indexing, since punctuation certainly doesn't say anything about the content of a document!), and makes a list of all the word tokens that are not in the stop word set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "stop_word_file = \"stop-word-list.txt\"\n", "with open(stop_word_file) as f :\n", " stop_words = set(line.strip() for line in f)\n", " \n", "def content_words(text) :\n", " return [w.lower() for w in re.findall(r\"\\w(?:'?\\w)*\", text)\n", " if w.lower() not in stop_words]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vector space models treat _training documents_ differently from _test documents_. They have to, because language has lots of rare words. You will almost always see words in your test documents (your queries) that you haven't seen anywhere in your corpus of possible responses. You have no choice but to ignore these new words. (Or do you?)\n", "\n", "The situation is different for the training documents: here you want to record each new word that you see so that if it comes up again (in training or test) you can associate the word with the documents where it occurs.\n", "\n", "So we now define two different procedures that will count up the words in a document, to make the famous [bag of words][1] representation. In the case of a training document, we keep all the content words that occur in the document, and add any new ones to a list of the vocabulary of our collection. In the case of a test document, we only keep the words that already occur in the vocabulary, discarding counts for new words.\n", "\n", "[1]:http://en.wikipedia.org/wiki/Bag-of-words_model" ] }, { "cell_type": "code", "collapsed": true, "input": [ "def count_train(text, vocabulary) :\n", " words = content_words(text)\n", " vocabulary.update(words)\n", " return collections.Counter(words)\n", "\n", "def count_test(text, vocabulary) :\n", " words = [w for w in content_words(text) if w in vocabulary]\n", " return collections.Counter(words)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now define the key quantities that the information retrieval model uses to weight words.\n", "- _Term frequency_ (TF) is a normalized measure of how often a word occurs in a document. It's the number of time a word occurs divided by the number of content word tokens in the document. \n", "- _Inverse document frequency_ (IDF) is the negative log of the fraction of documents that feature a particular term. Thus, terms that occur in more documents have a lower IDF score and so are less important, while terms that occur in fewer documents have a larger IDF score and are more important.\n", "\n", "Finally, because we're interested in _cosine measures of similarity_ which we explain more below, we _normalize_ the scores in the representation of each document, so each document is associated with a vector (in TFIDF space) with length 1." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def mk_idf(vocabulary, counts) :\n", " ndocs = float(len(counts))\n", " result = dict()\n", " for w in vocabulary :\n", " result[w] = math.log(ndocs / sum(1 if count[w] > 0 else 0 for count in counts))\n", " return result\n", "\n", "def mk_tf_idf(count, idf) :\n", " total = sum(count[w] for w in count)\n", " result = dict((w, idf[w] * count[w] / total) for w in count)\n", " length = math.sqrt(sum(result[w] * result[w] for w in result))\n", " for w in result :\n", " result[w] = result[w] / length\n", " return result" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code does all the work of setting up the vector space representation of our text collection. We accumulate all the words that occur in our corpus in `vocabulary` and accumulate in `counts` a bag of words dictionary for each of our texts. Then we can compile the inverse document frequencies for all our vocabulary items as `idf_dict` and finally create a normalized, weighted TFIDF vector representation for each document in `scores`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "vocabulary = set()\n", "counts = [count_train(utt, vocabulary) for utt in responses]\n", "idf_dict = mk_idf(vocabulary, counts)\n", "scores = [mk_tf_idf(count, idf_dict) for count in counts]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function defines the _cosine similarity measure_. In our model, each document is interpreted as a vector. Think of it as an arrow pointing from the origin of the space off into a particular direction (representing its topic). When two arrows point in close to the same direction, they are similar to each other. The distance between two arrows is the angle between them, and you can measure the degree to which the arrows are aligned by the cosine of this angle (a number that's 1 when the arrows are pointing exactly the same way, so the angle is 0, and a number that's 0 when the arrows are pointed perpendicularly to one another, so the angle is 90). It turns out that you can measure the cosine by the dot product, which is the sum of the product of corresponding components in the vectors. So this code is very simple but it has a very complicated intuition behind it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def similarity(d1, d2) :\n", " return sum(d1[w] * d2[w] for w in d1 if w in d2)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The strategy for our chatbot will now be to respond by retrieval: to make an utterance that seems to be as related as possible to the utterance provided by the user. So we measure the similarity of all our response data to what the user just said, and sort the responses by similarity." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def sort_responses(tf_idf_utt, scores, utts) :\n", " options = [(utts[i], similarity(tf_idf_utt, scores[i]))\n", " for i in range(len(utts))]\n", " return sorted(options, key = lambda (u, s): s, reverse=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rather than simply saying the most relevant thing all the time, we will randomize our responses. We prefer the most relevant ones - more precisely, we say each response with a probability that's proportional to the relevance we have estimated." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def weighted_random_item(options) :\n", " total = sum(w for (u,w) in options)\n", " r = random.uniform(0,total)\n", " i = 0\n", " while r > 0 :\n", " u, w = options[i]\n", " if r < w or i == len(options) - 1 :\n", " return u\n", " else :\n", " r = r - w\n", " i = i + 1 " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the main response engine: take input text, process it to compute its vector representation, and then say something relevant in response." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def respond(text) :\n", " content = count_test(text, vocabulary)\n", " tf_idf_utt = mk_tf_idf(content, idf_dict)\n", " return(weighted_random_item(sort_responses(tf_idf_utt, scores, responses)))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some example responses. You get a sense of Oscar Wilde's personality and interests from this. Probably you could hope for a better collection of responses - maybe something with a mix of facts and opinions? And maybe there is a better way to measure continuity of topic - perhaps by prioritizing words at the beginning of documents? Or using a part of speech tagger to focus on particular categories of words? It's all worth exploring." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print respond(\"love is splendid.\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "The only way to behave to a woman, is to make love to her if she is\n", "pretty and to someone else if she is plain.\n" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "print respond(\"a fine romance with no kisses.\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "What are the virtues? Nature, Renan tells us, cares little about\n", "chastity, and it may be that it is to the shame of the Magdalen, and not\n", "to their own purity, that the Lucretias of modern life owe their freedom\n", "from stain. Charity, as even those of whose religion it makes a formal\n", "part have been compelled to acknowledge, creates a multitude of evils.\n", "The mere existence of conscience, that faculty of which people prate so\n", "much nowadays, and are so ignorantly proud, is a sign of our imperfect\n", "development. It must be merged in instinct before we become fine.\n", "Self-denial is simply a method by which man arrests his progress, and\n", "self-sacrifice a survival of the mutilation of the savage, part of that\n", "old worship of pain which is so terrible a factor in the history of the\n", "world, and which even now makes its victims day by day and has its\n", "altars in the land. Virtues! Who knows what the virtues are? Not you.\n", "Not I. Not anyone. It is well for our vanity that we slay the criminal,\n", "for if we suffered him to live he might show us what we had gained by\n", "his crime. It is well for his peace that the saint goes to his\n", "martyrdom. He is spared the sight of the horror of his harvest.\n" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "print respond(\"i'm not feeling so well today doctor. can you help me?\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "In this world there are only two tragedies. One is not getting what one\n", "wants, and the other is getting it. The last is much the worst -- the last\n", "is a real tragedy.\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "print respond(\"i will take that as a no.\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Men of the noblest possible moral character are extremely susceptible to\n", "the influence of the physical charms of others. Modern, no less than\n", "ancient, history supplies us with many most painful examples of what I\n", "refer to. If it were not so, indeed, history would be quite unreadable.\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "print respond(\"slots are my favorite animal.\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "One is tempted to define man as a rational animal who always loses his\n", "temper when he is called upon to act in accordance with the dictates of\n", "reason.\n" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A final reminder -- here too we can use our usual code to create a chatbot interaction by invoking this script as the argument to python in a terminal.\n", "```python\n", "if __name__ == '__main__':\n", " print(\"\"\"\n", "Therapist\n", "---------\n", "Talk to the program by typing in plain English, using normal upper-\n", "and lower-case letters and punctuation. Enter \"quit\" when done.'\"\"\")\n", " print('='*72)\n", " print(\"Hello. What have you been thinking about lately?\")\n", " s = \"\"\n", " while s != \"quit\":\n", " s = input(\">\")\n", " while s and s[-1] in \"!.\":\n", " s = s[:-1]\n", " \n", " print(respond(s))\n", "```\n" ] } ], "metadata": {} } ] }