{ "metadata": { "gist_id": "7c54600b9c2af68914b3", "name": "", "signature": "sha256:47c50ff502afb39876f531313aa7c66924c6e9a01b725345c7f390300fccf3b2" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Clustering Related Posts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks. Now we can vectorize our data.\n", "\n", "Let's vectorize a new post, then see how similar it is to our existing corpus." ] }, { "cell_type": "code", "collapsed": false, "input": [ "new_post = 'Opening beer bottles and cans 101'\n", "new_post_vect = vect.transform([new_post])\n", "\n", "print(new_post_vect).toarray()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "import scipy as sp\n", "\n", "def dists(v1, v2):\n", " delta = v1-v2\n", " # Calculate Euclidean \"norm\" distance\n", " return sp.linalg.norm(delta.toarray())\n", "\n", "import sys\n", "\n", "def similarity(new_post_vector, corpus):\n", " best_dist = 999\n", " best_i = None\n", " \n", " for i in xrange(len(corpus.toarray())):\n", " post = posts[i]\n", " \n", " if post == new_post:\n", " continue\n", " post_vec = corpus.getrow(i)\n", " d = dists(post_vec, new_post_vector)\n", " print 'Post %i with dist = %.2f: %s'%(i, d, post)\n", " \n", " if d < best_dist:\n", " best_dist = d\n", " best_i = i\n", " print 'Best post is {} with dist = {}'.format(best_i, best_dist)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "similarity(new_post_vect, X_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Post 0 with dist = 3.00: how to open a beer without a bottle opener\n", "Post 1 with dist = 2.45: Do girls like beer bottles or beer cans?\n", "Post 2 with dist = 2.83: where did all my beer go?\n", "Post 3 with dist = 4.90: where did all my beer go? where did all my beer go?\n", "Post 4 with dist = 1.00: recycling beer bottles and cans\n", "Post 5 with dist = 2.83: Is it worth recycling?\n", "Post 6 with dist = 3.32: do not bring bottles to my backyard party, only cans please.\n", "Post 7 with dist = 2.65: This is useless\n", "Best post is 4 with dist = 1.0\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, our first text similarity measurement! We can see here that post 3 is most similar to our new post. However, we can see that `post 2` is \"closer\" to `post 3`, even though `post 3` is simply `post 2` doubled. It is clear the simple counts of words is too simple. The next step is to normalize the word counts to get vectors of unitless lengths to avoid this problem." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Update our dists function\n", "def dists(v1, v2):\n", " v1_norm = v1/sp.linalg.norm(v1.toarray())\n", " v2_norm = v2/sp.linalg.norm(v2.toarray())\n", " delta = v1_norm-v2_norm\n", " # Calculate Euclidean \"norm\" distance\n", " return sp.linalg.norm(delta.toarray())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "similarity(new_post_vect, X_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Post 0 with dist = 1.27: how to open a beer without a bottle opener\n", "Post 1 with dist = 0.86: Do girls like beer bottles or beer cans?\n", "Post 2 with dist = 1.26: where did all my beer go?\n", "Post 3 with dist = 1.26: where did all my beer go? where did all my beer go?\n", "Post 4 with dist = 0.46: recycling beer bottles and cans\n", "Post 5 with dist = 1.41: Is it worth recycling?\n", "Post 6 with dist = 1.18: do not bring bottles to my backyard party, only cans please.\n", "Post 7 with dist = 1.41: This is useless\n", "Best post is 4 with dist = 0.459505841095\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, posts 2 & 3 are now equally similar to our new post." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Removing Less Important Words:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many words in language that do not carry much meaning in terms of the overall interpretation of the message. Words like \"it\" should be much less meaningful than \"beer\" in our current context. These less important words are called `stop words`, and can be removed from the posts since they do not help us distiguish between different posts." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Add english stop words to our vectorizer object.\n", "vect = CountVectorizer(min_df=1, stop_words='english')\n", "#Display a sample\n", "print sorted(vect.get_stop_words())[80:-150]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name']\n" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you already have a list of words in mind you with to `stop`, you can simply pass them as a list to the `stop_words` argument." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Stemming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need to consider that similar words, such as \"girl\" and \"girls\" should probably be considered as the same word. Thus we need a function that reduces words to a finite 'word stem'. We can do thsi with the **Natural Language Toolkit (NLTK)**. After installing NLTK, import the library and try out the stemmer for english." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import nltk.stem\n", "\n", "s = nltk.stem.SnowballStemmer('english')\n", "\n", "print s.stem('bottles')\n", "print s.stem('bottle')\n", "\n", "print s.stem('perception')\n", "print s.stem('perceptive')\n", "\n", "print s.stem('crashing')\n", "print s.stem('crashed')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "bottl\n", "bottl\n", "percept\n", "percept\n", "crash\n", "crash\n" ] } ], "prompt_number": 13 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Extending the vectorizer with NLTK stemming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to step the posts before we feed then into the `CountVectorizer`. The best way to do this is overwrite the method `build_analyzer`. \n", "\n", "By doing this we utilize the preprocessing functions in the parent class that converts the raw posts into lower case. We tokenize all the words, and then convert each word into the stemmed version." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import nltk.stem\n", "\n", "english_stemmer = nltk.stem.SnowballStemmer('english')\n", "\n", "class StemmedCountVectorizer(CountVectorizer):\n", " def build_analyzer(self):\n", " analyzer = super(StemmedCountVectorizer, self).build_analyzer()\n", " return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))\n", " \n", "vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "X = vectorizer.fit_transform(posts)\n", "vectorizer.get_feature_names()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "[u'backyard',\n", " u'beer',\n", " u'bottl',\n", " u'bring',\n", " u'can',\n", " u'did',\n", " u'girl',\n", " u'like',\n", " u'open',\n", " u'parti',\n", " u'recycl',\n", " u'useless',\n", " u'worth']" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "# Restate the new vectorizer on the data\n", "X_train = vectorizer.fit_transform(posts)\n", "new_post_vect = vectorizer.transform([new_post])\n", "\n", "similarity(new_post_vect, X_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Post 0 with dist = 0.61: how to open a beer without a bottle opener\n", "Post 1 with dist = 0.77: Do girls like beer bottles or beer cans?\n", "Post 2 with dist = 1.14: where did all my beer go?\n", "Post 3 with dist = 1.14: where did all my beer go? where did all my beer go?\n", "Post 4 with dist = 0.71: recycling beer bottles and cans\n", "Post 5 with dist = 1.41: Is it worth recycling?\n", "Post 6 with dist = 1.05: do not bring bottles to my backyard party, only cans please.\n", "Post 7 with dist = 1.41: This is useless\n", "Best post is 0 with dist = 0.605810893055\n" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see now that post 0 is most similar to our new post, because bottles and bottle are now treated as the same word." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print new_post\n", "print posts[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Opening beer bottles and cans 101\n", "how to open a beer without a bottle opener\n" ] } ], "prompt_number": 17 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Thinking a bit deeper about relevant post features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we have considered that higher occurrence of certains words in post equates to a greater importance of that word in the post. While this is true to some extent, there is the case where very frequent words really don't carry any meaning to posts. For example, the word \"Subject\" appears in every blog post, thus it is not really communicating anything important, and does not help us distinguish between posts.\n", "\n", "We could perhaps set a 90% occurrence cutoff in our tokenizer, such that words that occur in >90% of the posts are excluded, however, we still run into the problem of border cases, say where the word occurs in only 89% of the posts.\n", "\n", "To solve these problems we count the term frequencies for every post **while** discounting those words that appear in many posts. This is the concept of **term frequency - inverse document frequency (TF-IDF)**. We can implement this using scikit learn's `TfidfVectorizer`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "# Rebuild the function to include our stemmer\n", "\n", "class StemmedTfidfVectorizer(TfidfVectorizer):\n", " def build_analyzer(self):\n", " analyzer = super(TfidfVectorizer, self).build_analyzer()\n", " \n", " return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))\n", "\n", "vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now instead of counts, our document vectors will contain individual TF-IDF values per term (token)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Restate new vectorizer\n", "X_train = vectorizer.fit_transform(posts)\n", "new_post_vect = vectorizer.transform([new_post])\n", "\n", "similarity(new_post_vect, X_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Post 0 with dist = 0.57: how to open a beer without a bottle opener\n", "Post 1 with dist = 0.99: Do girls like beer bottles or beer cans?\n", "Post 2 with dist = 1.26: where did all my beer go?\n", "Post 3 with dist = 1.26: where did all my beer go? where did all my beer go?\n", "Post 4 with dist = 0.90: recycling beer bottles and cans\n", "Post 5 with dist = 1.41: Is it worth recycling?\n", "Post 6 with dist = 1.17: do not bring bottles to my backyard party, only cans please.\n", "Post 7 with dist = 1.41: This is useless\n", "Best post is 0 with dist = 0.572957858071\n" ] } ], "prompt_number": 19 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Recap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far we have:\n", "\n", "1. Tokenized text\n", "2. Discard words that occur too often and don't help us detect relevant posts\n", "3. Throw away very uncommon words\n", "4. Count the remaining words\n", "5. Calculated TF-IDF values from the counts, considering the whole text corpus.\n", "\n", "**Limitations of the bag-of-words approach**\n", "\n", "* It does not cover word relations: \"Car hits wall\" and \"Wall hits car\" will both have the same feature vector.\n", "* It does not count negations well: \"I will eat soup\" and \"I will *not* eat soup\" will have very similar feature vectors. Though this can be remedied by also counting bigrams and trigrams (two or three words in a row together).\n", "* Totally fails with misspelled words." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we can represent our blog posts quantitatively, to some degree. Now our goal is to cluster similar posts. There are two main times of clustering algorithms: **flat** and **hierarchical**. \n", "\n", "**Flat clustering** divides the posts into sets of clusters that minimizes the difference _within_ clusters and maximized the difference _between_ clusters. Generally we have to specify the number of clusters upfront.\n", "\n", "**Hierarchical clustering** does not require the number of clusters as an input. It creates a hierarchy of clusters where very similar posts are grouped together, then similar clusters are then further grouped recursively until one cluster is left that contains all the data. Once completed, the user can discern the optimal number of clusters." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "KMeans" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "KMeans is probably the most common **flat** clustering algorithm. First you must specify the number of desired clusters (k). From there, the algorithm first specifies k random _seeds_ within the data. Then it assigns each post to the closest seed centroid. Next, the seeds are relocated to the mean center of the points initially assigned to it. Then the process is repeat, whereby the posts are then reassigned based on the new closest seed point. This continues as long as the seed centroids move a considerable amount, after some _n_ iterations, the movements will fall below a threshold. The algorithm is then considered converged." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Get some test data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will utilize a machine learning dataset that contains 18 826 posts from 20 different newsgroups. There are many topics including technology, politics, and religion. However, for now we will only use the technical groups.\n", "\n", "One question we could ask is, for a certain topic, can we effectivly cluster the newgroups who published that topic into distinct categories?\n", "\n", "\n", "This data is already split into testing and training data, we can download the data using sklearn." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sklearn.datasets\n", "\n", "save_dir = '/users/ryankelly/downloads/' # Your save file path\n", "\n", "# Download data using sklearn\n", "df = sklearn.datasets.load_mlcomp(\"20news-18828\", mlcomp_root=save_dir)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "# Data files\n", "print df.filenames\n", "print len(df.filenames)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['/users/ryankelly/downloads/379/raw/comp.graphics/1190-38614'\n", " '/users/ryankelly/downloads/379/raw/comp.graphics/1383-38616'\n", " '/users/ryankelly/downloads/379/raw/alt.atheism/487-53344' ...,\n", " '/users/ryankelly/downloads/379/raw/rec.sport.hockey/10215-54303'\n", " '/users/ryankelly/downloads/379/raw/sci.crypt/10799-15660'\n", " '/users/ryankelly/downloads/379/raw/comp.os.ms-windows.misc/2732-10871']\n", "18828\n" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "# Data Topics\n", "df.target_names" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "['alt.atheism',\n", " 'comp.graphics',\n", " 'comp.os.ms-windows.misc',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'comp.sys.mac.hardware',\n", " 'comp.windows.x',\n", " 'misc.forsale',\n", " 'rec.autos',\n", " 'rec.motorcycles',\n", " 'rec.sport.baseball',\n", " 'rec.sport.hockey',\n", " 'sci.crypt',\n", " 'sci.electronics',\n", " 'sci.med',\n", " 'sci.space',\n", " 'soc.religion.christian',\n", " 'talk.politics.guns',\n", " 'talk.politics.mideast',\n", " 'talk.politics.misc',\n", " 'talk.religion.misc']" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "# Restrict data to only 'tech' categories\n", "group = ['comp.graphics', 'comp.os.ms-windows.misc', \n", " 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware', \n", " 'comp.windows.x', 'sci.space']\n", "# Reload in only training data with the desired categories\n", "train_data = sklearn.datasets.load_mlcomp('20news-18828', 'train', \n", " mlcomp_root=save_dir, \n", " categories=group)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "print(len(train_data.filenames))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3414\n" ] } ], "prompt_number": 25 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Clustering posts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While initializing our `vectorizer` we have to remember that we are working with real data, which has many errors, which in this case invalid characers that cannot be encoded." ] }, { "cell_type": "code", "collapsed": false, "input": [ "vec = StemmedTfidfVectorizer(min_df=10, max_df=0.5,\n", " stop_words='english', decode_error='ignore')\n", "\n", "vecData = vec.fit_transform(train_data.data)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "num_samples, num_features = vecData.shape\n", "print('#samples: {}, #features: {}').format(num_samples, num_features)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "#samples: 3414, #features: 4331\n" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the information we will use as input for KMeans clustering. Since we know there are 5 topic groups in these data, it makes sense that there could be 5 clusters in the data, so we will try this first." ] }, { "cell_type": "code", "collapsed": false, "input": [ "num_clusters = 5\n", "from sklearn.cluster import KMeans\n", "\n", "km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)\n", "km.fit(vecData)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Initialization complete\n", "Iteration 0, inertia 6434.212\n", "Iteration 1, inertia 3302.138\n", "Iteration 2, inertia 3286.234\n", "Iteration 3, inertia 3278.006\n", "Iteration 4, inertia 3274.039" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 5, inertia 3271.234\n", "Iteration 6, inertia 3268.856\n", "Iteration 7, inertia 3267.609\n", "Iteration 8, inertia 3266.964\n", "Iteration 9, inertia 3266.352" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 10, inertia 3265.901\n", "Iteration 11, inertia 3265.509\n", "Iteration 12, inertia 3264.970\n", "Iteration 13, inertia 3263.969\n", "Iteration 14, inertia 3261.887" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 15, inertia 3259.657\n", "Iteration 16, inertia 3258.196\n", "Iteration 17, inertia 3257.560\n", "Iteration 18, inertia 3256.997\n", "Iteration 19, inertia 3256.714" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 20, inertia 3256.482\n", "Iteration 21, inertia 3256.326\n", "Iteration 22, inertia 3256.126\n", "Iteration 23, inertia 3255.998\n", "Iteration 24, inertia 3255.918" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 25, inertia 3255.870\n", "Iteration 26, inertia 3255.826\n", "Iteration 27, inertia 3255.768\n", "Iteration 28, inertia 3255.658\n", "Iteration 29, inertia 3255.574" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 30, inertia 3255.550\n", "Iteration 31, inertia 3255.533\n", "Iteration 32, inertia 3255.527\n", "Iteration 33, inertia 3255.522\n", "Iteration 34, inertia 3255.513" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Iteration 35, inertia 3255.508\n", "Iteration 36, inertia 3255.503\n", "Converged at iteration 36\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 110, "text": [ "KMeans(copy_x=True, init='random', max_iter=300, n_clusters=5, n_init=1,\n", " n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,\n", " verbose=1)" ] } ], "prompt_number": 110 }, { "cell_type": "markdown", "metadata": {}, "source": [ "After fitting, we can get the clustering information out of the `labels_` property, and cluster centers from `cluster_centers_`. We then measure the completeness score to see the percentage of correct predictions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import metrics\n", "\n", "metrics.completeness_score(train_data.target, km.labels_)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 111, "text": [ "0.40904043798434664" ] } ], "prompt_number": 111 }, { "cell_type": "markdown", "metadata": {}, "source": [ "39% accuracy isn't the best, but this could be because although there are five different topics, the contents are related between them, why dont we test several `k` values and see the prediction scores. Remember though that these results are still `in sample` error, and are probably better than we can expect on real data. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Solve a real problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are at the stage where we can recommend similar articles to the user. This could be implemented as part of the serach algorithm, or simply recommended posts to read after the current page.\n", "\n", "We first need to vectorize the new post before we predict it's label." ] }, { "cell_type": "code", "collapsed": false, "input": [ "new_post = '''hard drives can fail at any time,\n", " it is important to always backup your data.'''\n", " \n", "new_post_vec = vec.transform([new_post])\n", "new_post_label = km.predict(new_post_vec)[0] # predict the class it belongs to" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 114 }, { "cell_type": "code", "collapsed": false, "input": [ "# Select all posts with the same cluster label as the new post vector\n", "similar_label = (km.labels_ == new_post_label).nonzero()[0]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 115 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, between the records we know are similar, we build a new list of similarity scores, similar to what we did above in earlier examples." ] }, { "cell_type": "code", "collapsed": false, "input": [ "similar = []\n", "for i in similar_label:\n", " dist = sp.linalg.norm((new_post_vec - vecData[i].toarray()))\n", " similar.append((dist, train_data.target[i], train_data.data[i]))\n", "similar = sorted(similar)\n", "print(len(similar))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "175\n" ] } ], "prompt_number": 116 }, { "cell_type": "code", "collapsed": false, "input": [ "# Present the most similar posts\n", "print similar[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(1.1757159813728066, 2, 'From: gjp@sei.cmu.edu (George Pandelios)\\nSubject: Help me select a Backup Solution\\n\\n\\nHi Netters!\\n\\nI\\'m looking at purchasing some sort of backup solution. After you read about\\nmy situation, I\\'d like your opinion. Here\\'s the scenario:\\n\\n1. There are two computers in the house. One is a small 286 (40MB IDE drive).\\n The other is a 386DX (213 SCSI drive w/ Adaptec 1522 controller). Both \\n systems have PC TOOLS and will use Central Point Backup as the backup / \\n restore program. Both systems have 3.5\" and 5.25\" floppies.\\n\\n2. The computers are not networked (nor will they be anytime soon).\\n\\nFrom what I have seen so far, there appear to be at least 4 possible\\nsolutions (I\\'m sure there are others I haven\\'t thought about). For these \\noptions, I would appreciate hearing from anyone who has tried them or sees \\nany flaws (drive type X won\\'t coexist with device Y, etc.) in my thinking \\n(I don\\'t know very much about these beasts):\\n\\n1. Put 2.88MB floppy drives (or a combination drive) on each system.\\n Can someone supply cost and brand information? What\\'s a good brand?\\n What do the floppies themselves cost?\\n\\n\\n2. Put an internal tape backup unit on the 386 using my SCSI adapter, and\\n continue to back up the 286 with floppies. Again, can someone recommend a\\n few manufacturers? The only brand I remember is Colorado Memories. Any\\n happy or unhappy users (I know about the compression controversy)?\\n \\n\\n3. Connect an external tape backup unit on the 386 using my SCSI adapter, and\\n (maybe?) connect it to the 286 somehow (any suggestions?)\\n\\n\\n4. Install a Floptical drive in each machine. Again, any gotcha\\'s or \\n recommendations for manufacturers? \\n\\nI appreciate your help. You may either post or send me e-mail. I will\\nsummarize all responses for the net.\\n\\nThanks,\\n\\nGeorge\\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\\n George J. Pandelios\\t\\t\\t\\tInternet: gjp@sei.cmu.edu\\n Software Engineering Institute\\t\\tusenet:\\t sei!gjp\\n 4500 Fifth Avenue\\t\\t\\t\\tVoice:\\t (412) 268-7186\\n Pittsburgh, PA 15213\\t\\t\\t\\tFAX:\\t (412) 268-5758\\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\\nDisclaimer: These opinions are my own and do not reflect those of the\\n\\t Software Engineering Institute, its sponsors, customers, \\n\\t clients, affiliates, or Carnegie Mellon University. In fact,\\n\\t any resemblence of these opinions to any individual, living\\n\\t or dead, fictional or real, is purely coincidental. So there.\\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\\n')\n" ] } ], "prompt_number": 117 }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "\n", "\n", "def css_styling():\n", " styles = open(\"/users/ryankelly/desktop/custom_notebook.css\", \"r\").read()\n", "\n", " return HTML(styles)\n", "css_styling()\n" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", " \n", "\n", "\n", "\n" ], "metadata": {}, "output_type": "pyout", "prompt_number": 122, "text": [ "" ] } ], "prompt_number": 122 }, { "cell_type": "code", "collapsed": false, "input": [ "def social():\n", " code = \"\"\"\n", " Tweet\n", "\n", " Follow @Ryanmdk\n", "\n", "